-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathsphinx.html
4450 lines (4403 loc) · 382 KB
/
sphinx.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Sphinx 0.9.10 reference manual</title><style type="text/css">
pre.programlisting
{
background-color: #f0f0f0;
padding: 0.5em;
margin-left: 2em;
margin-right: 2em;
}
</style><meta name="generator" content="DocBook XSL Stylesheets V1.70.1"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="article" lang="en"><div class="titlepage"><div><div><h1 class="title"><a name="id104153"></a>Sphinx 0.9.10 reference manual</h1></div><div><h3 class="subtitle"><i>Free open-source SQL full-text search engine</i></h3></div><div><p class="copyright">Copyright © 2001-2009 Andrew Aksyonoff, <code class="email"><<a href="mailto:shodan(at)shodan.ru">shodan(at)shodan.ru</a>></code></p></div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="sect1"><a href="#intro">1. Introduction</a></span></dt><dd><dl><dt><span class="sect2"><a href="#about">1.1. About</a></span></dt><dt><span class="sect2"><a href="#features">1.2. Sphinx features</a></span></dt><dt><span class="sect2"><a href="#getting">1.3. Where to get Sphinx</a></span></dt><dt><span class="sect2"><a href="#license">1.4. License</a></span></dt><dt><span class="sect2"><a href="#author">1.5. Author and contributors</a></span></dt><dt><span class="sect2"><a href="#history">1.6. History</a></span></dt></dl></dd><dt><span class="sect1"><a href="#installation">2. Installation</a></span></dt><dd><dl><dt><span class="sect2"><a href="#supported-system">2.1. Supported systems</a></span></dt><dt><span class="sect2"><a href="#required-tools">2.2. Required tools</a></span></dt><dt><span class="sect2"><a href="#installing">2.3. Installing Sphinx on Linux</a></span></dt><dt><span class="sect2"><a href="#installing-windows">2.4. Installing Sphinx on Windows</a></span></dt><dt><span class="sect2"><a href="#install-problems">2.5. Known installation issues</a></span></dt><dt><span class="sect2"><a href="#quick-tour">2.6. Quick Sphinx usage tour</a></span></dt></dl></dd><dt><span class="sect1"><a href="#indexing">3. Indexing</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sources">3.1. Data sources</a></span></dt><dt><span class="sect2"><a href="#attributes">3.2. Attributes</a></span></dt><dt><span class="sect2"><a href="#mva">3.3. MVA (multi-valued attributes)</a></span></dt><dt><span class="sect2"><a href="#indexes">3.4. Indexes</a></span></dt><dt><span class="sect2"><a href="#data-restrictions">3.5. Restrictions on the source data</a></span></dt><dt><span class="sect2"><a href="#charsets">3.6. Charsets, case folding, and translation tables</a></span></dt><dt><span class="sect2"><a href="#sql">3.7. SQL data sources (MySQL, PostgreSQL)</a></span></dt><dt><span class="sect2"><a href="#xmlpipe">3.8. xmlpipe data source</a></span></dt><dt><span class="sect2"><a href="#xmlpipe2">3.9. xmlpipe2 data source</a></span></dt><dt><span class="sect2"><a href="#live-updates">3.10. Live index updates</a></span></dt><dt><span class="sect2"><a href="#index-merging">3.11. Index merging</a></span></dt></dl></dd><dt><span class="sect1"><a href="#searching">4. Searching</a></span></dt><dd><dl><dt><span class="sect2"><a href="#matching-modes">4.1. Matching modes</a></span></dt><dt><span class="sect2"><a href="#boolean-syntax">4.2. Boolean query syntax</a></span></dt><dt><span class="sect2"><a href="#extended-syntax">4.3. Extended query syntax</a></span></dt><dt><span class="sect2"><a href="#weighting">4.4. Weighting</a></span></dt><dt><span class="sect2"><a href="#sorting-modes">4.5. Sorting modes</a></span></dt><dt><span class="sect2"><a href="#clustering">4.6. Grouping (clustering) search results </a></span></dt><dt><span class="sect2"><a href="#distributed">4.7. Distributed searching</a></span></dt><dt><span class="sect2"><a href="#query-log-format">4.8. <code class="filename">searchd</code> query log format</a></span></dt><dt><span class="sect2"><a href="#sphinxql">4.9. MySQL protocol support and SphinxQL</a></span></dt><dt><span class="sect2"><a href="#multi-queries">4.10. Multi-queries</a></span></dt></dl></dd><dt><span class="sect1"><a href="#command-line-tools">5. Command line tools reference</a></span></dt><dd><dl><dt><span class="sect2"><a href="#ref-indexer">5.1. <code class="filename">indexer</code> command reference</a></span></dt><dt><span class="sect2"><a href="#ref-searchd">5.2. <code class="filename">searchd</code> command reference</a></span></dt><dt><span class="sect2"><a href="#ref-search">5.3. <code class="filename">search</code> command reference</a></span></dt><dt><span class="sect2"><a href="#ref-spelldump">5.4. <code class="filename">spelldump</code> command reference</a></span></dt><dt><span class="sect2"><a href="#ref-indextool">5.5. <code class="filename">indextool</code> command reference</a></span></dt></dl></dd><dt><span class="sect1"><a href="#api-reference">6. API reference</a></span></dt><dd><dl><dt><span class="sect2"><a href="#api-funcgroup-general">6.1. General API functions</a></span></dt><dd><dl><dt><span class="sect3"><a href="#api-func-getlasterror">6.1.1. GetLastError</a></span></dt><dt><span class="sect3"><a href="#api-func-getlastwarning">6.1.2. GetLastWarning</a></span></dt><dt><span class="sect3"><a href="#api-func-setserver">6.1.3. SetServer</a></span></dt><dt><span class="sect3"><a href="#api-func-setretries">6.1.4. SetRetries</a></span></dt><dt><span class="sect3"><a href="#api-func-setconnecttimeout">6.1.5. SetConnectTimeout</a></span></dt><dt><span class="sect3"><a href="#api-func-setarrayresult">6.1.6. SetArrayResult</a></span></dt><dt><span class="sect3"><a href="#api-func-isconnecterror">6.1.7. IsConnectError</a></span></dt></dl></dd><dt><span class="sect2"><a href="#api-funcgroup-general-query-settings">6.2. General query settings</a></span></dt><dd><dl><dt><span class="sect3"><a href="#api-func-setlimits">6.2.1. SetLimits</a></span></dt><dt><span class="sect3"><a href="#api-func-setmaxquerytime">6.2.2. SetMaxQueryTime</a></span></dt><dt><span class="sect3"><a href="#api-func-setoverride">6.2.3. SetOverride</a></span></dt><dt><span class="sect3"><a href="#api-func-setselect">6.2.4. SetSelect</a></span></dt></dl></dd><dt><span class="sect2"><a href="#api-funcgroup-fulltext-query-settings">6.3. Full-text search query settings</a></span></dt><dd><dl><dt><span class="sect3"><a href="#api-func-setmatchmode">6.3.1. SetMatchMode</a></span></dt><dt><span class="sect3"><a href="#api-func-setrankingmode">6.3.2. SetRankingMode</a></span></dt><dt><span class="sect3"><a href="#api-func-setsortmode">6.3.3. SetSortMode</a></span></dt><dt><span class="sect3"><a href="#api-func-setweights">6.3.4. SetWeights</a></span></dt><dt><span class="sect3"><a href="#api-func-setfieldweights">6.3.5. SetFieldWeights</a></span></dt><dt><span class="sect3"><a href="#api-func-setindexweights">6.3.6. SetIndexWeights</a></span></dt></dl></dd><dt><span class="sect2"><a href="#api-funcgroup-filtering">6.4. Result set filtering settings</a></span></dt><dd><dl><dt><span class="sect3"><a href="#api-func-setidrange">6.4.1. SetIDRange</a></span></dt><dt><span class="sect3"><a href="#api-func-setfilter">6.4.2. SetFilter</a></span></dt><dt><span class="sect3"><a href="#api-func-setfilterrange">6.4.3. SetFilterRange</a></span></dt><dt><span class="sect3"><a href="#api-func-setfilterfloatrange">6.4.4. SetFilterFloatRange</a></span></dt><dt><span class="sect3"><a href="#api-func-setgeoanchor">6.4.5. SetGeoAnchor</a></span></dt></dl></dd><dt><span class="sect2"><a href="#api-funcgroup-groupby">6.5. GROUP BY settings</a></span></dt><dd><dl><dt><span class="sect3"><a href="#api-func-setgroupby">6.5.1. SetGroupBy</a></span></dt><dt><span class="sect3"><a href="#api-func-setgroupdistinct">6.5.2. SetGroupDistinct</a></span></dt></dl></dd><dt><span class="sect2"><a href="#api-funcgroup-querying">6.6. Querying</a></span></dt><dd><dl><dt><span class="sect3"><a href="#api-func-query">6.6.1. Query</a></span></dt><dt><span class="sect3"><a href="#api-func-addquery">6.6.2. AddQuery</a></span></dt><dt><span class="sect3"><a href="#api-func-runqueries">6.6.3. RunQueries</a></span></dt><dt><span class="sect3"><a href="#api-func-resetfilters">6.6.4. ResetFilters</a></span></dt><dt><span class="sect3"><a href="#api-func-resetgroupby">6.6.5. ResetGroupBy</a></span></dt></dl></dd><dt><span class="sect2"><a href="#api-funcgroup-additional-functionality">6.7. Additional functionality</a></span></dt><dd><dl><dt><span class="sect3"><a href="#api-func-buildexcerpts">6.7.1. BuildExcerpts</a></span></dt><dt><span class="sect3"><a href="#api-func-updateatttributes">6.7.2. UpdateAttributes</a></span></dt><dt><span class="sect3"><a href="#api-func-buildkeywords">6.7.3. BuildKeywords</a></span></dt><dt><span class="sect3"><a href="#api-func-escapestring">6.7.4. EscapeString</a></span></dt><dt><span class="sect3"><a href="#api-func-status">6.7.5. Status</a></span></dt></dl></dd><dt><span class="sect2"><a href="#api-funcgroup-pconn">6.8. Persistent connections</a></span></dt><dd><dl><dt><span class="sect3"><a href="#api-func-open">6.8.1. Open</a></span></dt><dt><span class="sect3"><a href="#api-func-close">6.8.2. Close</a></span></dt></dl></dd></dl></dd><dt><span class="sect1"><a href="#sphinxse">7. MySQL storage engine (SphinxSE)</a></span></dt><dd><dl><dt><span class="sect2"><a href="#sphinxse-overview">7.1. SphinxSE overview</a></span></dt><dt><span class="sect2"><a href="#sphinxse-installing">7.2. Installing SphinxSE</a></span></dt><dd><dl><dt><span class="sect3"><a href="#sphinxse-mysql50">7.2.1. Compiling MySQL 5.0.x with SphinxSE</a></span></dt><dt><span class="sect3"><a href="#sphinxse-mysql51">7.2.2. Compiling MySQL 5.1.x with SphinxSE</a></span></dt><dt><span class="sect3"><a href="#sphinxse-checking">7.2.3. Checking SphinxSE installation</a></span></dt></dl></dd><dt><span class="sect2"><a href="#sphinxse-using">7.3. Using SphinxSE</a></span></dt><dt><span class="sect2"><a href="#sphinxse-snippets">7.4. Building snippets (excerpts) via MySQL</a></span></dt></dl></dd><dt><span class="sect1"><a href="#reporting-bugs">8. Reporting bugs</a></span></dt><dt><span class="sect1"><a href="#conf-reference">9. <code class="filename">sphinx.conf</code> options reference</a></span></dt><dd><dl><dt><span class="sect2"><a href="#confgroup-source">9.1. Data source configuration options</a></span></dt><dd><dl><dt><span class="sect3"><a href="#conf-source-type">9.1.1. type</a></span></dt><dt><span class="sect3"><a href="#conf-sql-host">9.1.2. sql_host</a></span></dt><dt><span class="sect3"><a href="#conf-sql-port">9.1.3. sql_port</a></span></dt><dt><span class="sect3"><a href="#conf-sql-user">9.1.4. sql_user</a></span></dt><dt><span class="sect3"><a href="#conf-sql-pass">9.1.5. sql_pass</a></span></dt><dt><span class="sect3"><a href="#conf-sql-db">9.1.6. sql_db</a></span></dt><dt><span class="sect3"><a href="#conf-sql-sock">9.1.7. sql_sock</a></span></dt><dt><span class="sect3"><a href="#conf-mysql-connect-flags">9.1.8. mysql_connect_flags</a></span></dt><dt><span class="sect3"><a href="#conf-mysql-ssl">9.1.9. mysql_ssl_cert, mysql_ssl_key, mysql_ssl_ca</a></span></dt><dt><span class="sect3"><a href="#conf-odbc-dsn">9.1.10. odbc_dsn</a></span></dt><dt><span class="sect3"><a href="#conf-sql-query-pre">9.1.11. sql_query_pre</a></span></dt><dt><span class="sect3"><a href="#conf-sql-query">9.1.12. sql_query</a></span></dt><dt><span class="sect3"><a href="#conf-sql-query-range">9.1.13. sql_query_range</a></span></dt><dt><span class="sect3"><a href="#conf-sql-range-step">9.1.14. sql_range_step</a></span></dt><dt><span class="sect3"><a href="#conf-sql-query-killlist">9.1.15. sql_query_killlist</a></span></dt><dt><span class="sect3"><a href="#conf-sql-attr-uint">9.1.16. sql_attr_uint</a></span></dt><dt><span class="sect3"><a href="#conf-sql-attr-bool">9.1.17. sql_attr_bool</a></span></dt><dt><span class="sect3"><a href="#conf-sql-attr-bigint">9.1.18. sql_attr_bigint</a></span></dt><dt><span class="sect3"><a href="#conf-sql-attr-timestamp">9.1.19. sql_attr_timestamp</a></span></dt><dt><span class="sect3"><a href="#conf-sql-attr-str2ordinal">9.1.20. sql_attr_str2ordinal</a></span></dt><dt><span class="sect3"><a href="#conf-sql-attr-float">9.1.21. sql_attr_float</a></span></dt><dt><span class="sect3"><a href="#conf-sql-attr-multi">9.1.22. sql_attr_multi</a></span></dt><dt><span class="sect3"><a href="#conf-sql-query-post">9.1.23. sql_query_post</a></span></dt><dt><span class="sect3"><a href="#conf-sql-query-post-index">9.1.24. sql_query_post_index</a></span></dt><dt><span class="sect3"><a href="#conf-sql-ranged-throttle">9.1.25. sql_ranged_throttle</a></span></dt><dt><span class="sect3"><a href="#conf-sql-query-info">9.1.26. sql_query_info</a></span></dt><dt><span class="sect3"><a href="#conf-xmlpipe-command">9.1.27. xmlpipe_command</a></span></dt><dt><span class="sect3"><a href="#conf-xmlpipe-field">9.1.28. xmlpipe_field</a></span></dt><dt><span class="sect3"><a href="#conf-xmlpipe-attr-uint">9.1.29. xmlpipe_attr_uint</a></span></dt><dt><span class="sect3"><a href="#conf-xmlpipe-attr-bool">9.1.30. xmlpipe_attr_bool</a></span></dt><dt><span class="sect3"><a href="#conf-xmlpipe-attr-timestamp">9.1.31. xmlpipe_attr_timestamp</a></span></dt><dt><span class="sect3"><a href="#conf-xmlpipe-attr-str2ordinal">9.1.32. xmlpipe_attr_str2ordinal</a></span></dt><dt><span class="sect3"><a href="#conf-xmlpipe-attr-float">9.1.33. xmlpipe_attr_float</a></span></dt><dt><span class="sect3"><a href="#conf-xmlpipe-attr-multi">9.1.34. xmlpipe_attr_multi</a></span></dt><dt><span class="sect3"><a href="#conf-xmlpipe-fixup-utf8">9.1.35. xmlpipe_fixup_utf8</a></span></dt><dt><span class="sect3"><a href="#conf-mssql-winauth">9.1.36. mssql_winauth</a></span></dt><dt><span class="sect3"><a href="#conf-mssql-unicode">9.1.37. mssql_unicode</a></span></dt><dt><span class="sect3"><a href="#conf-unpack-zlib">9.1.38. unpack_zlib</a></span></dt><dt><span class="sect3"><a href="#conf-unpack-mysqlcompress">9.1.39. unpack_mysqlcompress</a></span></dt><dt><span class="sect3"><a href="#conf-unpack-mysqlcompress-maxsize">9.1.40. unpack_mysqlcompress_maxsize</a></span></dt></dl></dd><dt><span class="sect2"><a href="#confgroup-index">9.2. Index configuration options</a></span></dt><dd><dl><dt><span class="sect3"><a href="#conf-index-type">9.2.1. type</a></span></dt><dt><span class="sect3"><a href="#conf-source">9.2.2. source</a></span></dt><dt><span class="sect3"><a href="#conf-path">9.2.3. path</a></span></dt><dt><span class="sect3"><a href="#conf-docinfo">9.2.4. docinfo</a></span></dt><dt><span class="sect3"><a href="#conf-mlock">9.2.5. mlock</a></span></dt><dt><span class="sect3"><a href="#conf-morphology">9.2.6. morphology</a></span></dt><dt><span class="sect3"><a href="#conf-min-stemming-len">9.2.7. min_stemming_len</a></span></dt><dt><span class="sect3"><a href="#conf-stopwords">9.2.8. stopwords</a></span></dt><dt><span class="sect3"><a href="#conf-wordforms">9.2.9. wordforms</a></span></dt><dt><span class="sect3"><a href="#conf-exceptions">9.2.10. exceptions</a></span></dt><dt><span class="sect3"><a href="#conf-min-word-len">9.2.11. min_word_len</a></span></dt><dt><span class="sect3"><a href="#conf-charset-type">9.2.12. charset_type</a></span></dt><dt><span class="sect3"><a href="#conf-charset-table">9.2.13. charset_table</a></span></dt><dt><span class="sect3"><a href="#conf-ignore-chars">9.2.14. ignore_chars</a></span></dt><dt><span class="sect3"><a href="#conf-min-prefix-len">9.2.15. min_prefix_len</a></span></dt><dt><span class="sect3"><a href="#conf-min-infix-len">9.2.16. min_infix_len</a></span></dt><dt><span class="sect3"><a href="#conf-prefix-fields">9.2.17. prefix_fields</a></span></dt><dt><span class="sect3"><a href="#conf-infix-fields">9.2.18. infix_fields</a></span></dt><dt><span class="sect3"><a href="#conf-enable-star">9.2.19. enable_star</a></span></dt><dt><span class="sect3"><a href="#conf-ngram-len">9.2.20. ngram_len</a></span></dt><dt><span class="sect3"><a href="#conf-ngram-chars">9.2.21. ngram_chars</a></span></dt><dt><span class="sect3"><a href="#conf-phrase-boundary">9.2.22. phrase_boundary</a></span></dt><dt><span class="sect3"><a href="#conf-phrase-boundary-step">9.2.23. phrase_boundary_step</a></span></dt><dt><span class="sect3"><a href="#conf-html-strip">9.2.24. html_strip</a></span></dt><dt><span class="sect3"><a href="#conf-html-index-attrs">9.2.25. html_index_attrs</a></span></dt><dt><span class="sect3"><a href="#conf-html-remove-elements">9.2.26. html_remove_elements</a></span></dt><dt><span class="sect3"><a href="#conf-local">9.2.27. local</a></span></dt><dt><span class="sect3"><a href="#conf-agent">9.2.28. agent</a></span></dt><dt><span class="sect3"><a href="#conf-agent-blackhole">9.2.29. agent_blackhole</a></span></dt><dt><span class="sect3"><a href="#conf-agent-connect-timeout">9.2.30. agent_connect_timeout</a></span></dt><dt><span class="sect3"><a href="#conf-agent-query-timeout">9.2.31. agent_query_timeout</a></span></dt><dt><span class="sect3"><a href="#conf-preopen">9.2.32. preopen</a></span></dt><dt><span class="sect3"><a href="#conf-ondisk-dict">9.2.33. ondisk_dict</a></span></dt><dt><span class="sect3"><a href="#conf-inplace-enable">9.2.34. inplace_enable</a></span></dt><dt><span class="sect3"><a href="#conf-inplace-hit-gap">9.2.35. inplace_hit_gap</a></span></dt><dt><span class="sect3"><a href="#conf-inplace-docinfo-gap">9.2.36. inplace_docinfo_gap</a></span></dt><dt><span class="sect3"><a href="#conf-inplace-reloc-factor">9.2.37. inplace_reloc_factor</a></span></dt><dt><span class="sect3"><a href="#conf-inplace-write-factor">9.2.38. inplace_write_factor</a></span></dt><dt><span class="sect3"><a href="#conf-index-exact-words">9.2.39. index_exact_words</a></span></dt><dt><span class="sect3"><a href="#conf-overshort-step">9.2.40. overshort_step</a></span></dt><dt><span class="sect3"><a href="#conf-stopword-step">9.2.41. stopword_step</a></span></dt></dl></dd><dt><span class="sect2"><a href="#confgroup-indexer">9.3. <code class="filename">indexer</code> program configuration options</a></span></dt><dd><dl><dt><span class="sect3"><a href="#conf-mem-limit">9.3.1. mem_limit</a></span></dt><dt><span class="sect3"><a href="#conf-max-iops">9.3.2. max_iops</a></span></dt><dt><span class="sect3"><a href="#conf-max-iosize">9.3.3. max_iosize</a></span></dt><dt><span class="sect3"><a href="#conf-max-xmlpipe2-field">9.3.4. max_xmlpipe2_field</a></span></dt><dt><span class="sect3"><a href="#conf-write-buffer">9.3.5. write_buffer</a></span></dt></dl></dd><dt><span class="sect2"><a href="#confgroup-searchd">9.4. <code class="filename">searchd</code> program configuration options</a></span></dt><dd><dl><dt><span class="sect3"><a href="#conf-listen">9.4.1. listen</a></span></dt><dt><span class="sect3"><a href="#conf-address">9.4.2. address</a></span></dt><dt><span class="sect3"><a href="#conf-port">9.4.3. port</a></span></dt><dt><span class="sect3"><a href="#conf-log">9.4.4. log</a></span></dt><dt><span class="sect3"><a href="#conf-query-log">9.4.5. query_log</a></span></dt><dt><span class="sect3"><a href="#conf-read-timeout">9.4.6. read_timeout</a></span></dt><dt><span class="sect3"><a href="#conf-client-timeout">9.4.7. client_timeout</a></span></dt><dt><span class="sect3"><a href="#conf-max-children">9.4.8. max_children</a></span></dt><dt><span class="sect3"><a href="#conf-pid-file">9.4.9. pid_file</a></span></dt><dt><span class="sect3"><a href="#conf-max-matches">9.4.10. max_matches</a></span></dt><dt><span class="sect3"><a href="#conf-seamless-rotate">9.4.11. seamless_rotate</a></span></dt><dt><span class="sect3"><a href="#conf-preopen-indexes">9.4.12. preopen_indexes</a></span></dt><dt><span class="sect3"><a href="#conf-unlink-old">9.4.13. unlink_old</a></span></dt><dt><span class="sect3"><a href="#conf-attr-flush-period">9.4.14. attr_flush_period</a></span></dt><dt><span class="sect3"><a href="#conf-ondisk-dict-default">9.4.15. ondisk_dict_default</a></span></dt><dt><span class="sect3"><a href="#conf-max-packet-size">9.4.16. max_packet_size</a></span></dt><dt><span class="sect3"><a href="#conf-mva-updates-pool">9.4.17. mva_updates_pool</a></span></dt><dt><span class="sect3"><a href="#conf-crash-log-path">9.4.18. crash_log_path</a></span></dt><dt><span class="sect3"><a href="#conf-max-filters">9.4.19. max_filters</a></span></dt><dt><span class="sect3"><a href="#conf-max-filter-values">9.4.20. max_filter_values</a></span></dt><dt><span class="sect3"><a href="#conf-listen-backlog">9.4.21. listen_backlog</a></span></dt><dt><span class="sect3"><a href="#conf-read-buffer">9.4.22. read_buffer</a></span></dt><dt><span class="sect3"><a href="#conf-read-unhinted">9.4.23. read_unhinted</a></span></dt><dt><span class="sect3"><a href="#conf-max-batch-queries">9.4.24. max_batch_queries</a></span></dt><dt><span class="sect3"><a href="#conf-subtree-docs-cache">9.4.25. subtree_docs_cache</a></span></dt><dt><span class="sect3"><a href="#conf-subtree-hits-cache">9.4.26. subtree_hits_cache</a></span></dt></dl></dd></dl></dd><dt><span class="appendix"><a href="#changelog">A. Sphinx revision history</a></span></dt></dl></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro"></a>1. Introduction</h2></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="about"></a>1.1. About</h3></div></div></div><p>
Sphinx is a full-text search engine, distributed under GPL version 2.
Commercial licensing (eg. for embedded use) is also available upon request.
</p><p>
Generally, it's a standalone search engine, meant to provide fast,
size-efficient and relevant full-text search functions to other
applications. Sphinx was specially designed to integrate well with
SQL databases and scripting languages.
</p><p>
Currently built-in data source drivers support fetching data either via
direct connection to MySQL, or PostgreSQL, or from a pipe in a custom XML
format. Adding new drivers (eg. to natively support some other DBMSes)
is designed to be as easy as possible.
</p><p>
Search API is natively ported to PHP, Python, Perl, Ruby, Java, and
also available as a pluggable MySQL storage engine. API is very
lightweight so porting it to new language is known to take a few hours.
</p><p>
As for the name, Sphinx is an acronym which is officially decoded
as SQL Phrase Index. Yes, I know about CMU's Sphinx project.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="features"></a>1.2. Sphinx features</h3></div></div></div><p>
</p><div class="itemizedlist"><ul type="disc"><li>high indexing speed (upto 10 MB/sec on modern CPUs);</li><li>high search speed (avg query is under 0.1 sec on 2-4 GB text collections);</li><li>high scalability (upto 100 GB of text, upto 100 M documents on a single CPU);</li><li>provides good relevance ranking through combination of phrase proximity ranking and statistical (BM25) ranking;</li><li>provides distributed searching capabilities;</li><li>provides document exceprts generation;</li><li>provides searching from within MySQL through pluggable storage engine;</li><li>supports boolean, phrase, and word proximity queries;</li><li>supports multiple full-text fields per document (upto 32 by default);</li><li>supports multiple additional attributes per document (ie. groups, timestamps, etc);</li><li>supports stopwords;</li><li>supports both single-byte encodings and UTF-8;</li><li>supports English stemming, Russian stemming, and Soundex for morphology;</li><li>supports MySQL natively (MyISAM and InnoDB tables are both supported);</li><li>supports PostgreSQL natively.</li></ul></div><p>
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="getting"></a>1.3. Where to get Sphinx</h3></div></div></div><p>Sphinx is available through its official Web site at <a href="http://www.sphinxsearch.com/" target="_top">http://www.sphinxsearch.com/</a>.
</p><p>Currently, Sphinx distribution tarball includes the following software:
</p><div class="itemizedlist"><ul type="disc"><li><code class="filename">indexer</code>: an utility which creates fulltext indexes;</li><li><code class="filename">search</code>: a simple command-line (CLI) test utility which searches through fulltext indexes;</li><li><code class="filename">searchd</code>: a daemon which enables external software (eg. Web applications) to search through fulltext indexes;</li><li><code class="filename">sphinxapi</code>: a set of searchd client API libraries for popular Web scripting languages (PHP, Python, Perl, Ruby).</li><li><code class="filename">spelldump</code>: a simple command-line tool to extract the items from an <code class="filename">ispell</code> or <code class="filename">MySpell</code> (as bundled with OpenOffice) format dictionary to help customize your index, for use with <a href="#conf-wordforms" title="9.2.9. wordforms">wordforms</a>.</li><li><code class="filename">indextool</code>: an utility to dump miscellaneous debug information about the index, added in version 0.9.9-rc2.</li></ul></div><p>
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="license"></a>1.4. License</h3></div></div></div><p>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License,
or (at your option) any later version. See COPYING file for details.
</p><p>
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
</p><p>
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
</p><p>
If you don't want to be bound by GNU GPL terms (for instance,
if you would like to embed Sphinx in your software, but would not
like to disclose its source code), please contact
<a href="#author" title="1.5. Author and contributors">the author</a> to obtain
a commercial license.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="author"></a>1.5. Author and contributors</h3></div></div></div><h4><a name="id353649"></a>Author</h4><p>
Sphinx initial author and current primary developer is:
</p><div class="itemizedlist"><ul type="disc"><li>Andrew Aksyonoff, <code class="email"><<a href="mailto:shodan(at)shodan.ru">shodan(at)shodan.ru</a>></code></li></ul></div><p>
</p><h4><a name="id347603"></a>Contributors</h4><p>People who contributed to Sphinx and their contributions (in no particular order) are:
</p><div class="itemizedlist"><ul type="disc"><li>Robert "coredev" Bengtsson (Sweden), initial version of PostgreSQL data source;</li><li>Len Kranendonk, Perl API</li><li>Dmytro Shteflyuk, Ruby API</li></ul></div><p>
</p><p>
Many other people have contributed ideas, bug reports, fixes, etc.
Thank you!
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="history"></a>1.6. History</h3></div></div></div><p>
Sphinx development was started back in 2001, because I didn't manage
to find an acceptable search solution (for a database driven Web site)
which would meet my requirements. Actually, each and every important aspect was a problem:
</p><div class="itemizedlist"><ul type="disc"><li>search quality (ie. good relevance)
<div class="itemizedlist"><ul type="circle"><li>statistical ranking methods performed rather bad, especially on large collections of small documents (forums, blogs, etc)</li></ul></div></li><li>search speed
<div class="itemizedlist"><ul type="circle"><li>especially if searching for phrases which contain stopwords, as in "to be or not to be"</li></ul></div></li><li>moderate disk and CPU requirements when indexing
<div class="itemizedlist"><ul type="circle"><li>important in shared hosting enivronment, not to mention the indexing speed.</li></ul></div></li></ul></div><p>
</p><p>
Despite the amount of time passed and numerous improvements made in the
other solutions, there's still no solution which I personally would
be eager to migrate to.
</p><p>
Considering that and a lot of positive feedback received from Sphinx users
during last years, the obvious decision is to continue developing Sphinx
(and, eventually, to take over the world).
</p></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="installation"></a>2. Installation</h2></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="supported-system"></a>2.1. Supported systems</h3></div></div></div><p>
Most modern UNIX systems with a C++ compiler should be able
to compile and run Sphinx without any modifications.
</p><p>
Currently known systems Sphinx has been successfully running on are:
</p><div class="itemizedlist"><ul type="disc"><li>Linux 2.4.x, 2.6.x (various distributions)</li><li>Windows 2000, XP</li><li>FreeBSD 4.x, 5.x, 6.x</li><li>NetBSD 1.6, 3.0</li><li>Solaris 9, 11</li><li>Mac OS X</li></ul></div><p>
</p><p>
CPU architectures known to work include X86, X86-64, SPARC64.
</p><p>
I hope Sphinx will work on other Unix platforms as well.
If the platform you run Sphinx on is not in this list,
please do report it.
</p><p>
At the moment, Windows version of Sphinx is not intended to be used
in production, but rather for testing and debugging only. Two most prominent
issues are missing concurrent queries support (client queries are stacked
on TCP connection level instead), and missing index data rotation support.
There are succesful production installations which workaround these issues.
However, running high-volume search service under Windows
is still not recommended.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="required-tools"></a>2.2. Required tools</h3></div></div></div><p>
On UNIX, you will need the following tools to build
and install Sphinx:
</p><div class="itemizedlist"><ul type="disc"><li>a working C++ compiler. GNU gcc is known to work.</li><li>a good make program. GNU make is known to work.</li></ul></div><p>
</p><p>
On Windows, you will need Microsoft Visual C/C++ Studio .NET 2003 or 2005.
Other compilers/environments will probably work as well, but for the
time being, you will have to build makefile (or other environment
specific project files) manually.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="installing"></a>2.3. Installing Sphinx on Linux</h3></div></div></div><div class="orderedlist"><ol type="1"><li><p>
Extract everything from the distribution tarball (haven't you already?)
and go to the <code class="filename">sphinx</code> subdirectory:
</p><p><strong class="userinput"><code><div class="literallayout"><p>$ tar xzvf sphinx-0.9.8.tar.gz<br>
$ cd sphinx<br>
</p></div></code></strong></p></li><li><p>Run the configuration program:</p><p><strong class="userinput"><code><div class="literallayout"><p>$ ./configure</p></div></code></strong></p><p>
There's a number of options to configure. The complete listing may
be obtained by using <code class="option">--help</code> switch. The most important ones are:
</p><div class="itemizedlist"><ul type="disc"><li><code class="option">--prefix</code>, which specifies where to install Sphinx; such as <code class="option">--prefix=/usr/local/sphinx</code> (all of the examples use this prefix)</li><li><code class="option">--with-mysql</code>, which specifies where to look for MySQL
include and library files, if auto-detection fails;</li><li><code class="option">--with-pgsql</code>, which specifies where to look for PostgreSQL
include and library files.</li></ul></div><p>
</p></li><li><p>Build the binaries:</p><p><strong class="userinput"><code><div class="literallayout"><p>$ make</p></div></code></strong></p></li><li><p>Install the binaries in the directory of your choice: (defaults to <code class="filename">/usr/local/bin/</code> on *nix systems, but is overridden with <code class="option">configure --prefix</code>)</p><p><strong class="userinput"><code><div class="literallayout"><p>$ make install</p></div></code></strong></p></li></ol></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="installing-windows"></a>2.4. Installing Sphinx on Windows</h3></div></div></div><p>Installing Sphinx on a Windows server is often easier than installing on a Linux environment; unless you are preparing code patches, you can use the pre-compiled binary files from the Downloads area on the website.</p><div class="orderedlist"><ol type="1"><li><p>Extract everything from the .zip file you have downloaded - <code class="filename">sphinx-0.9.8-win32.zip</code> (or <code class="filename">sphinx-0.9.8-win32-pgsql.zip</code> if you need PostgresSQL support as well.) You can use Windows Explorer in Windows XP and up to extract the files, or a freeware package like 7Zip to open the archive.</p><p>For the remainder of this guide, we will assume that the folders are unzipped into <code class="filename">C:\Sphinx</code>, such that <code class="filename">searchd.exe</code> can be found in <code class="filename">C:\Sphinx\bin\searchd.exe</code>. If you decide to use any different location for the folders or configuration file, please change it accordingly.</p></li><li><p>Edit the contents of sphinx.conf.in - specifically entries relating to @CONFDIR@ - to paths suitable for your system.</p></li><li><p>Install the <code class="filename">searchd</code> system as a Windows service:</p><p><strong class="userinput"><code>C:\Sphinx\bin> C:\Sphinx\bin\searchd --install --config C:\Sphinx\sphinx.conf.in --servicename SphinxSearch</code></strong></p></li><li><p>The <code class="filename">searchd</code> service will now be listed in the Services panel within the Management Console, available from Administrative Tools. It will not have been started, as you will need to configure it and build your indexes with <code class="filename">indexer</code> before starting the service. A guide to do this can be found under <a href="#quick-tour" title="2.6. Quick Sphinx usage tour">Quick tour</a>.</p><p>During the next steps of the install (which involve running indexer pretty much as you would on Linux) you may find that you get an error relating to libmysql.dll not being found. If you have MySQL installed, you should find a copy of this library in your Windows directory, or sometimes in Windows\System32, or failing that in the MySQL core directories. If you do receive an error please copy libmysql.dll into the bin directory.</p></li></ol></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="install-problems"></a>2.5. Known installation issues</h3></div></div></div><p>
If <code class="filename">configure</code> fails to locate MySQL headers and/or libraries,
try checking for and installing <code class="filename">mysql-devel</code> package. On some systems,
it is not installed by default.
</p><p>
If <code class="filename">make</code> fails with a message which look like
</p><pre class="programlisting">
/bin/sh: g++: command not found
make[1]: *** [libsphinx_a-sphinx.o] Error 127
</pre><p>
try checking for and installing <code class="filename">gcc-c++</code> package.
</p><p>
If you are getting compile-time errors which look like
</p><pre class="programlisting">
sphinx.cpp:67: error: invalid application of `sizeof' to
incomplete type `Private::SizeError<false>'
</pre><p>
this means that some compile-time type size check failed.
The most probable reason is that off_t type is less than 64-bit
on your system. As a quick hack, you can edit sphinx.h and replace off_t
with DWORD in a typedef for SphOffset_t, but note that this will prohibit
you from using full-text indexes larger than 2 GB. Even if the hack helps,
please report such issues, providing the exact error message and
compiler/OS details, so I could properly fix them in next releases.
</p><p>
If you keep getting any other error, or the suggestions above
do not seem to help you, please don't hesitate to contact me.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="quick-tour"></a>2.6. Quick Sphinx usage tour</h3></div></div></div><p>
All the example commands below assume that you installed Sphinx
in <code class="filename">/usr/local/sphinx</code>, so <code class="filename">searchd</code> can
be found in <code class="filename">/usr/local/sphinx/bin/searchd</code>.
</p><p>
To use Sphinx, you will need to:
</p><div class="orderedlist"><ol type="1"><li><p>Create a configuration file.</p><p>
Default configuration file name is <code class="filename">sphinx.conf</code>.
All Sphinx programs look for this file in current working directory
by default.
</p><p>
Sample configuration file, <code class="filename">sphinx.conf.dist</code>, which has
all the options documented, is created by <code class="filename">configure</code>.
Copy and edit that sample file to make your own configuration: (assuming Sphinx is installed into <code class="filename">/usr/local/sphinx/</code>)
</p><p><strong class="userinput"><code><div class="literallayout"><p>$ cd /usr/local/sphinx/etc<br>
$ cp sphinx.conf.dist sphinx.conf<br>
$ vi sphinx.conf</p></div></code></strong></p><p>
Sample configuration file is setup to index <code class="filename">documents</code>
table from MySQL database <code class="filename">test</code>; so there's <code class="filename">example.sql</code>
sample data file to populate that table with a few documents for testing purposes:
</p><p><strong class="userinput"><code><div class="literallayout"><p>$ mysql -u test < /usr/local/sphinx/etc/example.sql</p></div></code></strong></p></li><li><p>Run the indexer to create full-text index from your data:</p><p><strong class="userinput"><code><div class="literallayout"><p>$ cd /usr/local/sphinx/etc<br>
$ /usr/local/sphinx/bin/indexer --all</p></div></code></strong></p></li><li><p>Query your newly created index!</p></li></ol></div><p>
To query the index from command line, use <code class="filename">search</code> utility:
</p><p><strong class="userinput"><code><div class="literallayout"><p>$ cd /usr/local/sphinx/etc<br>
$ /usr/local/sphinx/bin/search test</p></div></code></strong></p><p>
To query the index from your PHP scripts, you need to:
</p><div class="orderedlist"><ol type="1"><li><p>Run the search daemon which your script will talk to:</p><p><strong class="userinput"><code><div class="literallayout"><p>$ cd /usr/local/sphinx/etc<br>
$ /usr/local/sphinx/bin/searchd</p></div></code></strong></p></li><li><p>
Run the attached PHP API test script (to ensure that the daemon
was succesfully started and is ready to serve the queries):
</p><p><strong class="userinput"><code><div class="literallayout"><p>$ cd sphinx/api<br>
$ php test.php test</p></div></code></strong></p></li><li><p>
Include the API (it's located in <code class="filename">api/sphinxapi.php</code>)
into your own scripts and use it.
</p></li></ol></div><p>
Happy searching!
</p></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="indexing"></a>3. Indexing</h2></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="sources"></a>3.1. Data sources</h3></div></div></div><p>
The data to be indexed can generally come from very different
sources: SQL databases, plain text files, HTML files, mailboxes,
and so on. From Sphinx point of view, the data it indexes is a
set of structured <em class="glossterm">documents</em>, each of which has the
same set of <em class="glossterm">fields</em>. This is biased towards SQL, where
each row correspond to a document, and each column to a field.
</p><p>
Depending on what source Sphinx should get the data from,
different code is required to fetch the data and prepare it for indexing.
This code is called <em class="glossterm">data source driver</em> (or simply
<em class="glossterm">driver</em> or <em class="glossterm">data source</em> for brevity).
</p><p>
At the time of this writing, there are drivers for MySQL and
PostgreSQL databases, which can connect to the database using
its native C/C++ API, run queries and fetch the data. There's
also a driver called xmlpipe, which runs a specified command
and reads the data from its <code class="filename">stdout</code>.
See <a href="#xmlpipe" title="3.8. xmlpipe data source">Section 3.8, “xmlpipe data source”</a> section for the format description.
</p><p>
There can be as many sources per index as necessary. They will be
sequentially processed in the very same order which was specifed in
index definition. All the documents coming from those sources
will be merged as if they were coming from a single source.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="attributes"></a>3.2. Attributes</h3></div></div></div><p>
Attributes are additional values associated with each document
that can be used to perform additional filtering and sorting during search.
</p><p>
It is often desired to additionally process full-text search results
based not only on matching document ID and its rank, but on a number
of other per-document values as well. For instance, one might need to
sort news search results by date and then relevance,
or search through products within specified price range,
or limit blog search to posts made by selected users,
or group results by month. To do that efficiently, Sphinx allows
to attach a number of additional <em class="glossterm">attributes</em>
to each document, and store their values in the full-text index.
It's then possible to use stored values to filter, sort,
or group full-text matches.
</p><p>Attributes, unlike the fields, are not full-text indexed. They
are stored in the index, but it is not possible to search them as full-text,
and attempting to do so results in an error.</p><p>For example, it is impossible to use the extended matching mode expression
<code class="option">@column 1</code> to match documents where column is 1, if column is an
attribute, and this is still true even if the numeric digits are normally indexed.</p><p>Attributes can be used for filtering, though, to restrict returned
rows, as well as sorting or <a href="#clustering" title="4.6. Grouping (clustering) search results ">result grouping</a>;
it is entirely possible to sort results purely based on attributes, and ignore the search
relevance tools. Additionally, attributes are returned from the search daemon, while the
indexed text is not.</p><p>
A good example for attributes would be a forum posts table. Assume
that only title and content fields need to be full-text searchable -
but that sometimes it is also required to limit search to a certain
author or a sub-forum (ie. search only those rows that have some
specific values of author_id or forum_id columns in the SQL table);
or to sort matches by post_date column; or to group matching posts
by month of the post_date and calculate per-group match counts.
</p><p>
This can be achieved by specifying all the mentioned columns
(excluding title and content, that are full-text fields) as
attributes, indexing them, and then using API calls to
setup filtering, sorting, and grouping. Here as an example.
</p><h4><a name="id361050"></a>Example sphinx.conf part:</h4><p>
</p><pre class="programlisting">
...
sql_query = SELECT id, title, content, \
author_id, forum_id, post_date FROM my_forum_posts
sql_attr_uint = author_id
sql_attr_uint = forum_id
sql_attr_timestamp = post_date
...
</pre><p>
</p><h4><a name="id359652"></a>Example application code (in PHP):</h4><p>
</p><pre class="programlisting">
// only search posts by author whose ID is 123
$cl->SetFilter ( "author_id", array ( 123 ) );
// only search posts in sub-forums 1, 3 and 7
$cl->SetFilter ( "forum_id", array ( 1,3,7 ) );
// sort found posts by posting date in descending order
$cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
</pre><p>
</p><p>
Attributes are named. Attribute names are case insensitive.
Attributes are <span class="emphasis"><em>not</em></span> full-text indexed; they are stored in the index as is.
Currently supported attribute types are:
</p><div class="itemizedlist"><ul type="disc"><li>unsigned integers (1-bit to 32-bit wide);</li><li>UNIX timestamps;</li><li>floating point values (32-bit, IEEE 754 single precision);</li><li>string ordinals (specially computed integers);</li><li><a href="#mva" title="3.3. MVA (multi-valued attributes)">MVA</a>, multi-value attributes (variable-length lists of 32-bit unsigned integers).</li></ul></div><p>
</p><p>
The complete set of per-document attribute values is sometimes
referred to as <em class="glossterm">docinfo</em>. Docinfos can either be
</p><div class="itemizedlist"><ul type="disc"><li>stored separately from the main full-text index data ("extern" storage, in <code class="filename">.spa</code> file), or</li><li>attached to each occurence of document ID in full-text index data ("inline" storage, in <code class="filename">.spd</code> file).</li></ul></div><p>
</p><p>
When using extern storage, a copy of <code class="filename">.spa</code> file
(with all the attribute values for all the documents) is kept in RAM by
<code class="filename">searchd</code> at all times. This is for performance reasons;
random disk I/O would be too slow. On the contrary, inline storage does not
require any additional RAM at all, but that comes at the cost of greatly
inflating the index size: remember that it copies <span class="emphasis"><em>all</em></span>
attribute value <span class="emphasis"><em>every</em></span> time when the document ID
is mentioned, and that is exactly as many times as there are
different keywords in the document. Inline may be the only viable
option if you have only a few attributes and need to work with big
datasets in limited RAM. However, in most cases extern storage
makes both indexing and searching <span class="emphasis"><em>much</em></span> more efficient.
</p><p>
Search-time memory requirements for extern storage are
(1+number_of_attrs)*number_of_docs*4 bytes, ie. 10 million docs with
2 groups and 1 timestamp will take (1+2+1)*10M*4 = 160 MB of RAM.
This is <span class="emphasis"><em>PER DAEMON</em></span>, not per query. <code class="filename">searchd</code>
will allocate 160 MB on startup, read the data and keep it shared between queries.
The children will <span class="emphasis"><em>NOT</em></span> allocate any additional
copies of this data.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="mva"></a>3.3. MVA (multi-valued attributes)</h3></div></div></div><p>
MVAs, or multi-valued attributes, are an important special type of per-document attributes in Sphinx.
MVAs make it possible to attach lists of values to every document.
They are useful for article tags, product categories, etc.
Filtering and group-by (but not sorting) on MVA attributes is supported.
</p><p>
Currently, MVA list entries are limited to unsigned 32-bit integers.
The list length is not limited, you can have an arbitrary number of values
attached to each document as long as RAM permits (<code class="filename">.spm</code> file
that contains the MVA values will be precached in RAM by <code class="filename">searchd</code>).
The source data can be taken either from a separate query, or from a document field;
see source type in <a href="#conf-sql-attr-multi" title="9.1.22. sql_attr_multi">sql_attr_multi</a>.
In the first case the query will have to return pairs of document ID and MVA values,
in the second one the field will be parsed for integer values.
There are absolutely no requirements as to incoming data order; the values will be
automatically grouped by document ID (and internally sorted within the same ID)
during indexing anyway.
</p><p>
When filtering, a document will match the filter on MVA attribute
if <span class="emphasis"><em>any</em></span> of the values satisfy the filtering condition.
(Therefore, documents that pass through exclude filters will not
contain any of the forbidden values.)
When grouping by MVA attribute, a document will contribute to as
many groups as there are different MVA values associated with that document.
For instance, if the collection contains exactly 1 document having a 'tag' MVA
with values 5, 7, and 11, grouping on 'tag' will produce 3 groups with
'@count' equal to 1 and '@groupby' key values of 5, 7, and 11 respectively.
Also note that grouping by MVA might lead to duplicate documents in the result set:
because each document can participate in many groups, it can be chosen as the best
one in in more than one group, leading to duplicate IDs. PHP API historically
uses ordered hash on the document ID for the resulting rows; so you'll also need to use
<a href="#api-func-setarrayresult" title="6.1.6. SetArrayResult">SetArrayResult()</a> in order
to employ group-by on MVA with PHP API.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="indexes"></a>3.4. Indexes</h3></div></div></div><p>
To be able to answer full-text search queries fast, Sphinx needs
to build a special data structure optimized for such queries from
your text data. This structure is called <em class="glossterm">index</em>; and
the process of building index from text is called <em class="glossterm">indexing</em>.
</p><p>
Different index types are well suited for different tasks.
For example, a disk-based tree-based index would be easy to
update (ie. insert new documents to existing index), but rather
slow to search. Therefore, Sphinx architecture allows for different
<em class="glossterm">index types</em> to be implemented easily.
</p><p>
The only index type which is implemented in Sphinx at the moment is
designed for maximum indexing and searching speed. This comes at a cost
of updates being really slow; theoretically, it might be slower to
update this type of index than than to reindex it from scratch.
However, this very frequently could be worked around with
muiltiple indexes, see <a href="#live-updates" title="3.10. Live index updates">Section 3.10, “Live index updates”</a> for details.
</p><p>
It is planned to implement more index types, including the
type which would be updateable in real time.
</p><p>
There can be as many indexes per configuration file as necessary.
<code class="filename">indexer</code> utility can reindex either all of them
(if <code class="option">--all</code> option is specified), or a certain explicitly
specified subset. <code class="filename">searchd</code> utility will serve all
the specified indexes, and the clients can specify what indexes to
search in run time.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="data-restrictions"></a>3.5. Restrictions on the source data</h3></div></div></div><p>
There are a few different restrictions imposed on the source data
which is going to be indexed by Sphinx, of which the single most
important one is:
</p><p><span class="bold"><strong>
ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS (32-BIT OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).
</strong></span></p><p>
If this requirement is not met, different bad things can happen.
For instance, Sphinx can crash with an internal assertion while indexing;
or produce strange results when searching due to conflicting IDs.
Also, a 1000-pound gorilla might eventually come out of your
display and start throwing barrels at you. You've been warned.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="charsets"></a>3.6. Charsets, case folding, and translation tables</h3></div></div></div><p>
When indexing some index, Sphinx fetches documents from
the specified sources, splits the text into words, and does
case folding so that "Abc", "ABC" and "abc" would be treated
as the same word (or, to be pedantic, <em class="glossterm">term</em>).
</p><p>
To do that properly, Sphinx needs to know
</p><div class="itemizedlist"><ul type="disc"><li>what encoding is the source text in;</li><li>what characters are letters and what are not;</li><li>what letters should be folded to what letters.</li></ul></div><p>
This should be configured on a per-index basis using
<code class="option"><a href="#conf-charset-type" title="9.2.12. charset_type">charset_type</a></code> and
<code class="option"><a href="#conf-charset-table" title="9.2.13. charset_table">charset_table</a></code> options.
<code class="option"><a href="#conf-charset-type" title="9.2.12. charset_type">charset_type</a></code>
specifies whether the document encoding is single-byte (SBCS) or UTF-8.
<code class="option"><a href="#conf-charset-table" title="9.2.13. charset_table">charset_table</a></code>
specifies the table that maps letter characters to their case
folded versions. The characters that are not in the table are considered
to be non-letters and will be treated as word separators when indexing
or searching through this index.
</p><p>
Note that while default tables do not include space character
(ASCII code 0x20, Unicode U+0020) as a letter, it's in fact
<span class="emphasis"><em>perfectly legal</em></span> to do so. This can be
useful, for instance, for indexing tag clouds, so that space-separated
word sets would index as a <span class="emphasis"><em>single</em></span> search query term.
</p><p>
Default tables currently include English and Russian characters.
Please do submit your tables for other languages!
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="sql"></a>3.7. SQL data sources (MySQL, PostgreSQL)</h3></div></div></div><p>
With all the SQL drivers, indexing generally works as follows.
</p><div class="itemizedlist"><ul type="disc"><li>connection to the database is established;</li><li>pre-query (see <a href="#conf-sql-query-pre" title="9.1.11. sql_query_pre">Section 9.1.11, “sql_query_pre”</a>) is executed
to perform any necessary initial setup, such as setting per-connection encoding with MySQL;</li><li>main query (see <a href="#conf-sql-query" title="9.1.12. sql_query">Section 9.1.12, “sql_query”</a>) is executed and the rows it returns are indexed;</li><li>post-query (see <a href="#conf-sql-query-post" title="9.1.23. sql_query_post">Section 9.1.23, “sql_query_post”</a>) is executed
to perform any necessary cleanup;</li><li>connection to the database is closed;</li><li>indexer does the sorting phase (to be pedantic, index-type specific post-processing);</li><li>connection to the database is established again;</li><li>post-index query (see <a href="#conf-sql-query-post-index" title="9.1.24. sql_query_post_index">Section 9.1.24, “sql_query_post_index”</a>) is executed
to perform any necessary final cleanup;</li><li>connection to the database is closed again.</li></ul></div><p>
Most options, such as database user/host/password, are straightforward.
However, there are a few subtle things, which are discussed in more detail here.
</p><h4><a name="ranged-queries"></a>Ranged queries</h4><p>
Main query, which needs to fetch all the documents, can impose
a read lock on the whole table and stall the concurrent queries
(eg. INSERTs to MyISAM table), waste a lot of memory for result set, etc.
To avoid this, Sphinx supports so-called <em class="glossterm">ranged queries</em>.
With ranged queries, Sphinx first fetches min and max document IDs from
the table, and then substitutes different ID intervals into main query text
and runs the modified query to fetch another chunk of documents.
Here's an example.
</p><div class="example"><a name="ex-ranged-queries"></a><p class="title"><b>Example 1. Ranged query usage example</b></p><div class="example-contents"><pre class="programlisting">
# in sphinx.conf
sql_query_range = SELECT MIN(id),MAX(id) FROM documents
sql_range_step = 1000
sql_query = SELECT * FROM documents WHERE id>=$start AND id<=$end
</pre></div></div><br class="example-break"><p>
If the table contains document IDs from 1 to, say, 2345, then sql_query would
be run three times:
</p><div class="orderedlist"><ol type="1"><li>with <code class="option">$start</code> replaced with 1 and <code class="option">$end</code> replaced with 1000;</li><li>with <code class="option">$start</code> replaced with 1001 and <code class="option">$end</code> replaced with 2000;</li><li>with <code class="option">$start</code> replaced with 2000 and <code class="option">$end</code> replaced with 2345.</li></ol></div><p>
Obviously, that's not much of a difference for 2000-row table,
but when it comes to indexing 10-million-row MyISAM table,
ranged queries might be of some help.
</p><h4><a name="id361616"></a><code class="option">sql_post</code> vs. <code class="option">sql_post_index</code></h4><p>
The difference between post-query and post-index query is in that post-query
is run immediately when Sphinx received all the documents, but further indexing
<span class="bold"><strong>may</strong></span> still fail for some other reason. On the contrary,
by the time the post-index query gets executed, it is <span class="bold"><strong>guaranteed</strong></span>
that the indexing was succesful. Database connection is dropped and re-established
because sorting phase can be very lengthy and would just timeout otherwise.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="xmlpipe"></a>3.8. xmlpipe data source</h3></div></div></div><p>
xmlpipe data source was designed to enable users to plug data into
Sphinx without having to implement new data sources drivers themselves.
It is limited to 2 fixed fields and 2 fixed attributes, and is deprecated
in favor of <a href="#xmlpipe2" title="3.9. xmlpipe2 data source">Section 3.9, “xmlpipe2 data source”</a> now. For new streams, use xmlpipe2.
</p><p>
To use xmlpipe, configure the data source in your configuration file
as follows:
</p><pre class="programlisting">
source example_xmlpipe_source
{
type = xmlpipe
xmlpipe_command = perl /www/mysite.com/bin/sphinxpipe.pl
}
</pre><p>
The <code class="filename">indexer</code> will run the command specified
in <code class="option"><a href="#conf-xmlpipe-command" title="9.1.27. xmlpipe_command">xmlpipe_command</a></code>,
and then read, parse and index the data it prints to <code class="filename">stdout</code>.
More formally, it opens a pipe to given command and then reads
from that pipe.
</p><p>
indexer will expect one or more documents in custom XML format.
Here's the example document stream, consisting of two documents:
</p><div class="example"><a name="ex-xmlpipe-document"></a><p class="title"><b>Example 2. XMLpipe document stream</b></p><div class="example-contents"><pre class="programlisting">
<document>
<id>123</id>
<group>45</group>
<timestamp>1132223498</timestamp>
<title>test title</title>
<body>
this is my document body
</body>
</document>
<document>
<id>124</id>
<group>46</group>
<timestamp>1132223498</timestamp>
<title>another test</title>
<body>
this is another document
</body>
</document>
</pre></div></div><p><br class="example-break">
</p><p>
Legacy xmlpipe legacy driver uses a builtin parser
which is pretty fast but really strict and does not actually
fully support XML. It requires that all the fields <span class="emphasis"><em>must</em></span>
be present, formatted <span class="emphasis"><em>exactly</em></span> as in this example, and
occur <span class="emphasis"><em>exactly</em></span> in the same order. The only optional
field is <code class="option">timestamp</code>; it defaults to 1.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="xmlpipe2"></a>3.9. xmlpipe2 data source</h3></div></div></div><p>
xmlpipe2 lets you pass arbitrary full-text and attribute data to Sphinx
in yet another custom XML format. It also allows to specify the schema
(ie. the set of fields and attributes) either in the XML stream itself,
or in the source settings.
</p><p>
When indexing xmlpipe2 source, indexer runs the given command, opens
a pipe to its stdout, and expects well-formed XML stream. Here's sample
stream data:
</p><div class="example"><a name="ex-xmlpipe2-document"></a><p class="title"><b>Example 3. xmlpipe2 document stream</b></p><div class="example-contents"><pre class="programlisting">
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="subject"/>
<sphinx:field name="content"/>
<sphinx:attr name="published" type="timestamp"/>
<sphinx:attr name="author_id" type="int" bits="16" default="1"/>
</sphinx:schema>
<sphinx:document id="1234">
<content>this is the main content <![CDATA[[and this <cdata> entry must be handled properly by xml parser lib]]></content>
<published>1012325463</published>
<subject>note how field/attr tags can be in <b class="red">randomized</b> order</subject>
<misc>some undeclared element</misc>
</sphinx:document>
<!-- ... more documents here ... -->
<sphinx:killlist>
<id>1234</id>
<id>4567/id>
</sphinx:killlist>
</sphinx:docset>
</pre></div></div><p><br class="example-break">
</p><p>
Arbitrary fields and attributes are allowed.
They also can occur in the stream in arbitrary order within each document; the order is ignored.
There is a restriction on maximum field length; fields longer than 2 MB will be truncated to 2 MB (this limit can be changed in the source).
</p><p>
The schema, ie. complete fields and attributes list, must be declared
before any document could be parsed. This can be done either in the
configuration file using <code class="option">xmlpipe_field</code> and <code class="option">xmlpipe_attr_XXX</code>
settings, or right in the stream using <sphinx:schema> element.
<sphinx:schema> is optional. It is only allowed to occur as the very
first sub-element in <sphinx:docset>. If there is no in-stream
schema definition, settings from the configuration file will be used.
Otherwise, stream settings take precedence.
</p><p>
Unknown tags (which were not declared neither as fields nor as attributes)
will be ignored with a warning. In the example above, <misc> will be ignored.
All embedded tags and their attributes (such as <b> in <subject>
in the example above) will be silently ignored.
</p><p>
Support for incoming stream encodings depends on whether <code class="filename">iconv</code>
is installed on the system. xmlpipe2 is parsed using <code class="filename">libexpat</code>
parser that understands US-ASCII, ISO-8859-1, UTF-8 and a few UTF-16 variants
natively. Sphinx <code class="filename">configure</code> script will also check
for <code class="filename">libiconv</code> presence, and utilize it to handle
other encodings. <code class="filename">libexpat</code> also enforces the
requirement to use UTF-8 charset on Sphinx side, because the
parsed data it returns is always in UTF-8.
</p><p>
XML elements (tags) recognized by xmlpipe2 (and their attributes where applicable) are:
</p><div class="variablelist"><dl><dt><span class="term">sphinx:docset</span></dt><dd>Mandatory top-level element, denotes and contains xmlpipe2 document set.</dd><dt><span class="term">sphinx:schema</span></dt><dd>Optional element, must either occur as the very first child
of sphinx:docset, or never occur at all. Declares the document schema.
Contains field and attribute declarations. If present, overrides
per-source settings from the configuration file.
</dd><dt><span class="term">sphinx:field</span></dt><dd>Optional element, child of sphinx:schema. Declares a full-text field.
The only recognized attribute is "name", it specifies the element name
that should be treated as a full-text field in the subsequent documents.
</dd><dt><span class="term">sphinx:attr</span></dt><dd>Optional element, child of sphinx:schema. Declares an attribute.
Known attributes are:
<div class="itemizedlist"><ul type="disc"><li>"name", specifies the element name that should be treated as an attribute in the subsequent documents.</li><li>"type", specifies the attribute type. Possible values are "int", "timestamp", "str2ordinal", "bool", "float" and "multi".</li><li>"bits", specifies the bit size for "int" attribute type. Valid values are 1 to 32.</li><li>"default", specifies the default value for this attribute that should be used if the attribute's element is not present in the document.</li></ul></div></dd><dt><span class="term">sphinx:document</span></dt><dd>Mandatory element, must be a child of sphinx:docset.
Contains arbitrary other elements with field and attribute values
to be indexed, as declared either using sphinx:field and sphinx:attr
elements or in the configuration file. The only known attribute
is "id" that must contain the unique integer document ID.
</dd><dt><span class="term">sphinx:killlist</span></dt><dd>Optional element, child of sphinx:docset.
Contains a number of "id" elements whose contents are document IDs
to be put into a <a href="#conf-sql-query-killlist" title="9.1.15. sql_query_killlist">kill-list</a> for this index.
</dd></dl></div><p>
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="live-updates"></a>3.10. Live index updates</h3></div></div></div><p>
There's a frequent situation when the total dataset is too big
to be reindexed from scratch often, but the amount of new records
is rather small. Example: a forum with a 1,000,000 archived posts,
but only 1,000 new posts per day.
</p><p>
In this case, "live" (almost real time) index updates could be
implemented using so called "main+delta" scheme.
</p><p>
The idea is to set up two sources and two indexes, with one
"main" index for the data which only changes rarely (if ever),
and one "delta" for the new documents. In the example above,
1,000,000 archived posts would go to the main index, and newly
inserted 1,000 posts/day would go to the delta index. Delta index
could then be reindexed very frequently, and the documents can
be made available to search in a matter of minutes.
</p><p>
Specifying which documents should go to what index and
reindexing main index could also be made fully automatical.
One option would be to make a counter table which would track
the ID which would split the documents, and update it
whenever the main index is reindexed.
</p><div class="example"><a name="ex-live-updates"></a><p class="title"><b>Example 4. Fully automated live updates</b></p><div class="example-contents"><pre class="programlisting">
# in MySQL
CREATE TABLE sph_counter
(
counter_id INTEGER PRIMARY KEY NOT NULL,
max_doc_id INTEGER NOT NULL
);
# in sphinx.conf
source main
{
# ...
sql_query_pre = SET NAMES utf8
sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
sql_query = SELECT id, title, body FROM documents \
WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
source delta : main
{
sql_query_pre = SET NAMES utf8
sql_query = SELECT id, title, body FROM documents \
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
index main
{
source = main
path = /path/to/main
# ... all the other settings
}
# note how all other settings are copied from main,
# but source and path are overridden (they MUST be)
index delta : main
{
source = delta
path = /path/to/delta
}
</pre></div></div><p><br class="example-break">
</p><p>
Note how we're overriding <code class="code">sql_query_pre</code> in the delta source.
We need to explicitly have that override. Otherwise <code class="code">REPLACE</code> query
would be run when indexing delta source too, effectively nullifying it. However,
when we issue the directive in the inherited source for the first time, it removes
<span class="emphasis"><em>all</em></span> inherited values, so the encoding setup is also lost.
So <code class="code">sql_query_pre</code> in the delta can not just be empty; and we need
to issue the encoding setup query explicitly once again.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="index-merging"></a>3.11. Index merging</h3></div></div></div><p>
Merging two existing indexes can be more efficient that indexing the data
from scratch, and desired in some cases (such as merging 'main' and 'delta'
indexes instead of simply reindexing 'main' in 'main+delta' partitioning
scheme). So <code class="filename">indexer</code> has an option to do that.
Merging the indexes is normally faster than reindexing but still
<span class="emphasis"><em>not</em></span> instant on huge indexes. Basically,
it will need to read the contents of both indexes once and write
the result once. Merging 100 GB and 1 GB index, for example,
will result in 202 GB of IO (but that's still likely less than
the indexing from scratch requires).
</p><p>
The basic command syntax is as follows:
</p><pre class="programlisting">
indexer --merge DSTINDEX SRCINDEX [--rotate]
</pre><p>
Only the DSTINDEX index will be affected: the contents of SRCINDEX will be merged into it.
<code class="option">--rotate</code> switch will be required if DSTINDEX is already being served by <code class="filename">searchd</code>.
The initially devised usage pattern is to merge a smaller update from SRCINDEX into DSTINDEX.
Thus, when merging the attributes, values from SRCINDEX will win if duplicate document IDs are encountered.
Note, however, that the "old" keywords will <span class="emphasis"><em>not</em></span> be automatically removed in such cases.
For example, if there's a keyword "old" associated with document 123 in DSTINDEX, and a keyword "new" associated
with it in SRCINDEX, document 123 will be found by <span class="emphasis"><em>both</em></span> keywords after the merge.
You can supply an explicit condition to remove documents from DSTINDEX to mitigate that;
the relevant switch is <code class="option">--merge-dst-range</code>:
</p><pre class="programlisting">
indexer --merge main delta --merge-dst-range deleted 0 0
</pre><p>
This switch lets you apply filters to the destination index along with merging.
There can be several filters; all of their conditions must be met in order
to include the document in the resulting mergid index. In the example above,
the filter passes only those records where 'deleted' is 0, eliminating all
records that were flagged as deleted (for instance, using
<a href="#api-func-updateatttributes" title="6.7.2. UpdateAttributes">UpdateAttributes()</a> call).
</p></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="searching"></a>4. Searching</h2></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="matching-modes"></a>4.1. Matching modes</h3></div></div></div><p>
There are the following matching modes available:
</p><div class="itemizedlist"><ul type="disc"><li>SPH_MATCH_ALL, matches all query words (default mode);</li><li>SPH_MATCH_ANY, matches any of the query words;</li><li>SPH_MATCH_PHRASE, matches query as a phrase, requiring perfect match;</li><li>SPH_MATCH_BOOLEAN, matches query as a boolean expression (see <a href="#boolean-syntax" title="4.2. Boolean query syntax">Section 4.2, “Boolean query syntax”</a>);</li><li>SPH_MATCH_EXTENDED, matches query as an expression in Sphinx internal query language (see <a href="#extended-syntax" title="4.3. Extended query syntax">Section 4.3, “Extended query syntax”</a>). As of 0.9.9, this has been superceded by SPH_MATCH_EXTENDED2, providing additional functionality and better performance. The ident is retained for legacy application code that will continue to be compatible once Sphinx and its components, including the API, are upgraded.</li><li>SPH_MATCH_EXTENDED2, matches query using the second version of the Extended matching mode.</li><li>SPH_MATCH_FULLSCAN, matches query, forcibly using the "full scan" mode as below. NB, any query terms will be ignored, such that filters, filter-ranges and grouping will still be applied, but no text-matching.</li></ul></div><p>
</p><p>
The SPH_MATCH_FULLSCAN mode will be automatically activated in place of the specified matching mode when the following conditions are met:
</p><div class="orderedlist"><ol type="1"><li>The query string is empty (ie. its length is zero).</li><li><a href="#conf-docinfo" title="9.2.4. docinfo">docinfo</a> storage is set to <code class="code">extern</code>.</li></ol></div><p>
In full scan mode, all the indexed documents will be considered as matching.
Such queries will still apply filters, sorting, and group by, but will not perform any full-text searching.
This can be useful to unify full-text and non-full-text searching code, or to offload SQL server (there are cases when Sphinx scans will perform better than analogous MySQL queries).
An example of using the full scan mode might be to find posts in a forum. By selecting the forum's user ID via <code class="code">SetFilter()</code> but not actually providing any search text, Sphinx will match every document (i.e. every post) where <code class="code">SetFilter()</code> would match - in this case providing every post from that user. By default this will be ordered by relevancy, followed by Sphinx document ID in ascending order (earliest first).
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="boolean-syntax"></a>4.2. Boolean query syntax</h3></div></div></div><p>
Boolean queries allow the following special operators to be used:
</p><div class="itemizedlist"><ul type="disc"><li>explicit operator AND: <pre class="programlisting">hello & world</pre></li><li>operator OR: <pre class="programlisting">hello | world</pre></li><li>operator NOT:
<pre class="programlisting">
hello -world
hello !world
</pre></li><li>grouping: <pre class="programlisting">( hello world )</pre></li></ul></div><p>
Here's an example query which uses all these operators:
</p><div class="example"><a name="ex-boolean-query"></a><p class="title"><b>Example 5. Boolean query example</b></p><div class="example-contents"><pre class="programlisting">
( cat -dog ) | ( cat -mouse)
</pre></div></div><p><br class="example-break">
</p><p>
There always is implicit AND operator, so "hello world" query actually
means "hello & world".
</p><p>
OR operator precedence is higher than AND, so "looking for cat | dog | mouse"
means "looking for ( cat | dog | mouse )" and <span class="emphasis"><em>not</em></span>
"(looking for cat) | dog | mouse".
</p><p>
Queries like "-dog", which implicitly include all documents from the
collection, can not be evaluated. This is both for technical and performance
reasons. Technically, Sphinx does not always keep a list of all IDs.
Performance-wise, when the collection is huge (ie. 10-100M documents),
evaluating such queries could take very long.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="extended-syntax"></a>4.3. Extended query syntax</h3></div></div></div><p>
The following special operators and modifiers can be used when using the extended matching mode:
</p><div class="itemizedlist"><ul type="disc"><li>operator OR: <pre class="programlisting">hello | world</pre></li><li>operator NOT:
<pre class="programlisting">
hello -world
hello !world
</pre></li><li>field search operator: <pre class="programlisting">@title hello @body world</pre></li><li>field position limit modifier (introduced in version 0.9.9-rc1): <pre class="programlisting">@body[50] hello</pre></li><li>multiple-field search operator: <pre class="programlisting">@(title,body) hello world</pre></li><li>all-field search operator: <pre class="programlisting">@* hello</pre></li><li>phrase search operator: <pre class="programlisting">"hello world"</pre></li><li>proximity search operator: <pre class="programlisting">"hello world"~10</pre></li><li>quorum matching operator: <pre class="programlisting">"the world is a wonderful place"/3</pre></li><li>strict order operator (aka operator "before"): <pre class="programlisting">aaa << bbb << ccc</pre></li><li>exact form modifier (introduced in version 0.9.9-rc1): <pre class="programlisting">raining =cats and =dogs</pre></li><li>field-start and field-end modifier (introduced in version 0.9.9-rc2): <pre class="programlisting">^hello world$</pre></li></ul></div><p>
Here's an example query that uses some of these operators:
</p><div class="example"><a name="ex-extended-query"></a><p class="title"><b>Example 6. Extended matching mode: query example</b></p><div class="example-contents"><pre class="programlisting">
"hello world" @title "example program"~5 @body python -(php|perl) @* code
</pre></div></div><p><br class="example-break">
The full meaning of this search is:
</p><div class="itemizedlist"><ul type="disc"><li>Find the words 'hello' and 'world' adjacently in any field in a document;</li><li>Additionally, the same document must also contain the words 'example' and 'program' in the title field, with up to, but not including, 10 words between the words in question; (E.g. "example PHP program" would be matched however "example script to introduce outside data into the correct context for your program" would not because two terms have 10 or more words between them)</li><li>Additionally, the same document must contain the word 'python' in the body field, but not contain either 'php' or 'perl';</li><li>Additionally, the same document must contain the word 'code' in any field.</li></ul></div><p>
</p><p>
There always is implicit AND operator, so "hello world" means that
both "hello" and "world" must be present in matching document.
</p><p>
OR operator precedence is higher than AND, so "looking for cat | dog | mouse"
means "looking for ( cat | dog | mouse )" and <span class="emphasis"><em>not</em></span>
"(looking for cat) | dog | mouse".
</p><p>
Field limit operator limits subsequent searching to a given field.
Normally, query will fail with an error message if given field name does not exist
in the searched index. However, that can be suppressed by specifying "@@relaxed"
option at the very beginning of the query:
</p><pre class="programlisting">
@@relaxed @nosuchfield my query
</pre><p>
This can be helpful when searching through heterogeneous indexes with
different schemas.
</p><p>
Field position limit, introduced in version 0.9.9-rc1, additionaly restricts the searching
to first N position within given field (or fields). For example, "@body[50] hello" will
<span class="bold"><strong>not</strong></span> match the documents where the keyword 'hello' occurs at position 51 and below
in the body.
</p><p>
Proximity distance is specified in words, adjusted for word count, and
applies to all words within quotes. For instance, "cat dog mouse"~5 query
means that there must be less than 8-word span which contains all 3 words,
ie. "CAT aaa bbb ccc DOG eee fff MOUSE" document will <span class="emphasis"><em>not</em></span>
match this query, because this span is exactly 8 words long.
</p><p>
Quorum matching operator introduces a kind of fuzzy matching.
It will only match those documents that pass a given threshold of given words.
The example above ("the world is a wonderful place"/3) will match all documents
that have at least 3 of the 6 specified words.
</p><p>
Strict order operator (aka operator "before"), introduced in version 0.9.9-rc2,
will match the document only if its argument keywords occur in the document
exactly in the query order. For instance, "black << cat" query (without
quotes) will match the document "black and white cat" but <span class="emphasis"><em>not</em></span>
the "that cat was black" document. Order operator has the lowest priority.
It can be applied both to just keywords and more complex expressions,
ie. this is a valid query:
</p><pre class="programlisting">
(bag of words) << "exact phrase" << red|green|blue
</pre><p>
</p><p>
Exact form keyword modifier, introduced in version 0.9.9-rc1, will match the document only if the keyword occurred
in exactly the specified form. The default behaviour is to match the document
if the stemmed keyword matches. For instance, "runs" query will match both
the document that contains "runs" <span class="emphasis"><em>and</em></span> the document that
contains "running", because both forms stem to just "run" - while "=runs"
query will only match the first document. Exact form operator requires
<a href="#conf-index-exact-words" title="9.2.39. index_exact_words">index_exact_words</a> option to be enabled.
This is a modifier that affects the keyword and thus can be used within
operators such as phrase, proximity, and quorum operators.
</p><p>
Field-start and field-end keyword modifiers, introduced in version 0.9.9-rc2,
will make the keyword match only if it occurred at the very start or the very end
of a fulltext field, respectively. For instance, the query "^hello world$"
(with quotes and thus combining phrase operator and start/end modifiers)
will only match documents that contain at least one field that has exactly
these two keywords.
</p><p>
Starting with 0.9.9-rc1, arbitrarily nested brackets and negations are allowed.
However, the query must be possible to compute without involving an implicit
list of all documents:
</p><pre class="programlisting">
// correct query
aaa -(bbb -(ccc ddd))
// queries that are non-computable
-aaa
aaa | -bbb
</pre><p>
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="weighting"></a>4.4. Weighting</h3></div></div></div><p>
Specific weighting function (currently) depends on the search mode.
</p><p>
There are these major parts which are used in the weighting functions:
</p><div class="orderedlist"><ol type="1"><li>phrase rank,</li><li>statistical rank.</li></ol></div><p>
</p><p>
Phrase rank is based on a length of longest common subsequence
(LCS) of search words between document body and query phrase. So if
there's a perfect phrase match in some document then its phrase rank
would be the highest possible, and equal to query words count.
</p><p>
Statistical rank is based on classic BM25 function which only takes
word frequencies into account. If the word is rare in the whole database
(ie. low frequency over document collection) or mentioned a lot in specific
document (ie. high frequency over matching document), it receives more weight.
Final BM25 weight is a floating point number between 0 and 1.
</p><p>
In all modes, per-field weighted phrase ranks are computed as
a product of LCS multiplied by per-field weight speficifed by user.
Per-field weights are integer, default to 1, and can not be set
lower than 1.
</p><p>
In SPH_MATCH_BOOLEAN mode, no weighting is performed at all, every match weight
is set to 1.
</p><p>
In SPH_MATCH_ALL and SPH_MATCH_PHRASE modes, final weight is a sum of weighted phrase ranks.
</p><p>
In SPH_MATCH_ANY mode, the idea is essentially the same, but it also
adds a count of matching words in each field. Before that, weighted
phrase ranks are additionally mutliplied by a value big enough to
guarantee that higher phrase rank in <span class="bold"><strong>any</strong></span> field will make the
match ranked higher, even if it's field weight is low.
</p><p>
In SPH_MATCH_EXTENDED mode, final weight is a sum of weighted phrase
ranks and BM25 weight, multiplied by 1000 and rounded to integer.
</p><p>
This is going to be changed, so that MATCH_ALL and MATCH_ANY modes
use BM25 weights as well. This would improve search results in those
match spans where phrase ranks are equal; this is especially useful
for 1-word queries.
</p><p>
The key idea (in all modes, besides boolean) is that better subphrase
matches are ranked higher, and perfect matches are pulled to the top. Author's
experience is that this phrase proximity based ranking provides noticeably
better search quality than any statistical scheme alone (such as BM25,
which is commonly used in other search engines).
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="sorting-modes"></a>4.5. Sorting modes</h3></div></div></div><p>
There are the following result sorting modes available:
</p><div class="itemizedlist"><ul type="disc"><li>SPH_SORT_RELEVANCE mode, that sorts by relevance in descending order (best matches first);</li><li>SPH_SORT_ATTR_DESC mode, that sorts by an attribute in descending order (bigger attribute values first);</li><li>SPH_SORT_ATTR_ASC mode, that sorts by an attribute in ascending order (smaller attribute values first);</li><li>SPH_SORT_TIME_SEGMENTS mode, that sorts by time segments (last hour/day/week/month) in descending order, and then by relevance in descending order;</li><li>SPH_SORT_EXTENDED mode, that sorts by SQL-like combination of columns in ASC/DESC order;</li><li>SPH_SORT_EXPR mode, that sorts by an arithmetic expression.</li></ul></div><p>
</p><p>
SPH_SORT_RELEVANCE ignores any additional parameters and always sorts matches
by relevance rank. All other modes require an additional sorting clause, with the
syntax depending on specific mode. SPH_SORT_ATTR_ASC, SPH_SORT_ATTR_DESC and
SPH_SORT_TIME_SEGMENTS modes require simply an attribute name.
SPH_SORT_RELEVANCE is equivalent to sorting by "@weight DESC, @id ASC" in extended sorting mode,
SPH_SORT_ATTR_ASC is equivalent to "attribute ASC, @weight DESC, @id ASC",
and SPH_SORT_ATTR_DESC to "attribute DESC, @weight DESC, @id ASC" respectively.
</p><h4><a name="id363038"></a>SPH_SORT_TIME_SEGMENTS mode</h4><p>
In SPH_SORT_TIME_SEGMENTS mode, attribute values are split into so-called
time segments, and then sorted by time segment first, and by relevance second.
</p><p>
The segments are calculated according to the <span class="emphasis"><em>current timestamp</em></span>
at the time when the search is performed, so the results would change over time.
The segments are as follows:
</p><div class="itemizedlist"><ul type="disc"><li>last hour,</li><li>last day,</li><li>last week,</li><li>last month,</li><li>last 3 months,</li><li>everything else.</li></ul></div><p>
These segments are hardcoded, but it is trivial to change them if necessary.
</p><p>
This mode was added to support searching through blogs, news headlines, etc.
When using time segments, recent records would be ranked higher because of segment,
but withing the same segment, more relevant records would be ranked higher -
unlike sorting by just the timestamp attribute, which would not take relevance
into account at all.
</p><h4><a name="sort-extended"></a>SPH_SORT_EXTENDED mode</h4><p>
In SPH_SORT_EXTENDED mode, you can specify an SQL-like sort expression
with up to 5 attributes (including internal attributes), eg:
</p><pre class="programlisting">
@relevance DESC, price ASC, @id DESC
</pre><p>
</p><p>
Both internal attributes (that are computed by the engine on the fly)
and user attributes that were configured for this index are allowed.
Internal attribute names must start with magic @-symbol; user attribute
names can be used as is. In the example above, <code class="option">@relevance</code>
and <code class="option">@id</code> are internal attributes and <code class="option">price</code> is user-specified.
</p><p>
Known internal attributes are:
</p><div class="itemizedlist"><ul type="disc"><li>@id (match ID)</li><li>@weight (match weight)</li><li>@rank (match weight)</li><li>@relevance (match weight)</li><li>@random (return results in random order)</li></ul></div><p>
<code class="option">@rank</code> and <code class="option">@relevance</code> are just additional
aliases to <code class="option">@weight</code>.
</p><h4><a name="sort-expr"></a>SPH_SORT_EXPR mode</h4><p>
Expression sorting mode lets you sort the matches by an arbitrary arithmetic
expression, involving attribute values, internal attributes (@id and @weight),
arithmetic operations, and a number of built-in functions. Here's an example:
</p><pre class="programlisting">
$cl->SetSortMode ( SPH_SORT_EXPR,
"@weight + ( user_karma + ln(pageviews) )*0.1" );
</pre><p>
</p><p>
The following operators and functions are supported. They are mimiced after MySQL.
The functions take a number of arguments depending on the specific function.
</p><div class="itemizedlist"><ul type="disc"><li>Operators: +, -, *, /, <, > <=, >=, =, <>.</li><li>Boolean operators: AND, OR, NOT.</li><li>0-argument functions: NOW().</li><li>Unary (1-argument) functions: ABS(), CEIL(), FLOOR(), SIN(), COS(), LN(), LOG2(), LOG10(), EXP(), SQRT(), BIGINT().</li><li>Binary (2-argument) functions: MIN(), MAX(), POW(), IDIV().</li><li>Other functions: IF(), INTERVAL(), IN(), GEODIST().</li></ul></div><p>
</p><p>
Calculations can be performed in three different modes: (a) using single-precision,
32-bit IEEE 754 floating point values (the default), (b) using signed 32-bit integers,
(c) using 64-bit signed integers. The expression parser will automatically switch
to integer mode if there are no operations the result in a floating point value.
Otherwise, it will use the default floating point mode. For instance, "a+b"
will be computed using 32-bit integers if both arguments are 32-bit integers;
or using 64-bit integers if both arguments are integers but one of them is
64-bit; or in floats otherwise. However, "a/b" or "sqrt(a)" will always be
computed in floats, because these operations return non-integer result.
To avoid the first, you can use IDIV(). Also, "a*b" will not be automatically
promoted to 64-bit when the arguments are 32-bit. To enforce 64-bit results,
you can use BIGINT(). (But note that if there are non-integer operations,
BIGINT() will simply be ignored.)
</p><p>
Comparison operators (eg. = or <=) return 1.0 when the condition is true and 0.0 otherwise.
For instance, <code class="code">(a=b)+3</code> will evaluate to 4 when attribute 'a' is equal to attribute 'b', and to 3 when 'a' is not.
Unlike MySQL, the equality comparisons (ie. = and <> operators) introduce a small equality threshold (1e-6 by default).
If the difference between compared values is within the threshold, they will be considered equal.
</p><p>
Boolean operators (AND, OR, NOT) were introduced in 0.9.9-rc2 and behave as usual.
They are left-associative and have the least priority compared to other operators.
NOT has more priority than AND and OR but nevertheless less than any other operator.
AND and OR have the same priority so brackets use is recommended to avoid confusion
in complex expressions.
</p><p>
All unary and binary functions are straightforward, they behave just like their mathematical counterparts.
But <code class="code">IF()</code> behavior needs to be explained in more detail.
It takes 3 arguments, check whether the 1st argument is equal to 0.0, returns the 2nd argument if it is not zero, or the 3rd one when it is.
Note that unlike comparison operators, <code class="code">IF()</code> does <span class="bold"><strong>not</strong></span> use a threshold!
Therefore, it's safe to use comparison results as its 1st argument, but arithmetic operators might produce unexpected results.
For instance, the following two calls will produce <span class="emphasis"><em>different</em></span> results even though they are logically equivalent:
</p><pre class="programlisting">
IF ( sqrt(3)*sqrt(3)-3<>0, a, b )
IF ( sqrt(3)*sqrt(3)-3, a, b )
</pre><p>
In the first case, the comparison operator <> will return 0.0 (false)
because of a threshold, and <code class="code">IF()</code> will always return 'b' as a result.
In the second one, the same <code class="code">sqrt(3)*sqrt(3)-3</code> expression will be compared
with zero <span class="emphasis"><em>without</em></span> threshold by the <code class="code">IF()</code> function itself.
But its value will be slightly different from zero because of limited floating point
calculations precision. Because of that, the comparison with 0.0 done by <code class="code">IF()</code>
will not pass, and the second variant will return 'a' as a result.
</p><p>
BIGINT() function, introduced in version 0.9.9-rc1, forcibly promotes the integer argument to 64-bit type,
and does nothing on floating point argument. It's intended to help enforce evaluation
of certain expressions (such as "a*b") in 64-bit mode even though all the arguments
are 32-bit.
</p><p>
IDIV() functions performs an integer division on its 2 arguments. The result
is integer as well, unlike "a/b" result.
</p><p>
IN(expr,val1,val2,...), introduced in version 0.9.9-rc1, takes 2 or more arguments, and returns 1 if 1st argument
(expr) is equal to any of the other arguments (val1..valN), or 0 otherwise.
Currently, all the checked values (but not the expression itself!) are required
to be constant. (Its technically possible to implement arbitrary expressions too,
and that might be implemented in the future.) Constants are pre-sorted and then
binary search is used, so IN() even against a big arbitrary list of constants
will be very quick. Starting with 0.9.9-rc2, first argument can also be
a MVA attribute. In that case, IN() will return 1 if any of the MVA values
is equal to any of the other arguments.
</p><p>
INTERVAL(expr,point1,point2,point3,...), introduced in version 0.9.9-rc1, takes 2 or more arguments, and returns
the index of the argument that is less than the first argument: it returns
0 if expr<point1, 1 if point1<=expr<point2, and so on.
It is required that point1<point2<...<pointN for this function
to work correctly.
</p><p>
NOW(), introduced in version 0.9.9-rc1, is a helper function that returns current timestamp as a 32-bit integer.
</p><p>
GEODIST(lat1,long1,lat2,long2) function, introduced in version 0.9.9-rc2,
computes geosphere distance between two given points specified by their
coordinates. Note that both latitudes and longitudes must be in radians
and the result will be in meters. You can use arbitrary expression as any
of the four coordinates. An optimized path will be selected when one pair
of the arguments refers directly to a pair attributes and the other one
is constant.
</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="clustering"></a>4.6. Grouping (clustering) search results </h3></div></div></div><p>
Sometimes it could be useful to group (or in other terms, cluster)
search results and/or count per-group match counts - for instance,
to draw a nice graph of how much maching blog posts were there per
each month; or to group Web search results by site; or to group
matching forum posts by author; etc.
</p><p>
In theory, this could be performed by doing only the full-text search
in Sphinx and then using found IDs to group on SQL server side. However,
in practice doing this with a big result set (10K-10M matches) would
typically kill performance.
</p><p>
To avoid that, Sphinx offers so-called grouping mode. It is enabled
with SetGroupBy() API call. When grouping, all matches are assigned to
different groups based on group-by value. This value is computed from
specified attribute using one of the following built-in functions:
</p><div class="itemizedlist"><ul type="disc"><li>SPH_GROUPBY_DAY, extracts year, month and day in YYYYMMDD format from timestamp;</li><li>SPH_GROUPBY_WEEK, extracts year and first day of the week number (counting from year start) in YYYYNNN format from timestamp;</li><li>SPH_GROUPBY_MONTH, extracts month in YYYYMM format from timestamp;</li><li>SPH_GROUPBY_YEAR, extracts year in YYYY format from timestamp;</li><li>SPH_GROUPBY_ATTR, uses attribute value itself for grouping.</li></ul></div><p>
</p><p>
The final search result set then contains one best match per group.
Grouping function value and per-group match count are returned along
as "virtual" attributes named
<span class="bold"><strong>@group</strong></span> and
<span class="bold"><strong>@count</strong></span> respectively.
</p><p>
The result set is sorted by group-by sorting clause, with the syntax similar
to <a href="#sort-extended"><code class="option">SPH_SORT_EXTENDED</code> sorting clause</a>
syntax. In addition to <code class="option">@id</code> and <code class="option">@weight</code>,
group-by sorting clause may also include:
</p><div class="itemizedlist"><ul type="disc"><li>@group (groupby function value),</li><li>@count (amount of matches in group).</li></ul></div><p>
</p><p>
The default mode is to sort by groupby value in descending order,