-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathsphinx.txt
8589 lines (6752 loc) · 338 KB
/
sphinx.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Sphinx 1.11-beta reference manual
=================================
Free open-source SQL full-text search engine
============================================
Copyright (c) 2001-2010 Andrew Aksyonoff
Copyright (c) 2008-2010 Sphinx Technologies Inc, http://sphinxsearch.com
----------------------------------------------------------------------------
Table of Contents
1. Introduction
1.1. About
1.2. Sphinx features
1.3. Where to get Sphinx
1.4. License
1.5. Credits
1.6. History
2. Installation
2.1. Supported systems
2.2. Required tools
2.3. Installing Sphinx on Linux
2.4. Installing Sphinx on Windows
2.5. Known installation issues
2.6. Quick Sphinx usage tour
3. Indexing
3.1. Data sources
3.2. Attributes
3.3. MVA (multi-valued attributes)
3.4. Indexes
3.5. Restrictions on the source data
3.6. Charsets, case folding, and translation tables
3.7. SQL data sources (MySQL, PostgreSQL)
3.8. xmlpipe data source
3.9. xmlpipe2 data source
3.10. Live index updates
3.11. Delta index updates
3.12. Index merging
4. Real-time indexes
4.1. RT indexes overview
4.2. Known caveats with RT indexes
4.3. RT index internals
4.4. Binary logging
5. Searching
5.1. Matching modes
5.2. Boolean query syntax
5.3. Extended query syntax
5.4. Weighting
5.5. Sorting modes
5.6. Grouping (clustering) search results
5.7. Distributed searching
5.8. searchd query log format
5.9. MySQL protocol support and SphinxQL
5.10. Multi-queries
6. Command line tools reference
6.1. indexer command reference
6.2. searchd command reference
6.3. search command reference
6.4. spelldump command reference
6.5. indextool command reference
7. SphinxQL reference
7.1. SELECT syntax
7.2. SHOW META syntax
7.3. SHOW WARNINGS syntax
7.4. SHOW STATUS syntax
7.5. INSERT and REPLACE syntax
7.6. DELETE syntax
7.7. SET syntax
7.8. BEGIN, COMMIT, and ROLLBACK syntax
7.9. CALL SNIPPETS syntax
7.10. CALL KEYWORDS syntax
8. API reference
8.1. General API functions
8.1.1. GetLastError
8.1.2. GetLastWarning
8.1.3. SetServer
8.1.4. SetRetries
8.1.5. SetConnectTimeout
8.1.6. SetArrayResult
8.1.7. IsConnectError
8.2. General query settings
8.2.1. SetLimits
8.2.2. SetMaxQueryTime
8.2.3. SetOverride
8.2.4. SetSelect
8.3. Full-text search query settings
8.3.1. SetMatchMode
8.3.2. SetRankingMode
8.3.3. SetSortMode
8.3.4. SetWeights
8.3.5. SetFieldWeights
8.3.6. SetIndexWeights
8.4. Result set filtering settings
8.4.1. SetIDRange
8.4.2. SetFilter
8.4.3. SetFilterRange
8.4.4. SetFilterFloatRange
8.4.5. SetGeoAnchor
8.5. GROUP BY settings
8.5.1. SetGroupBy
8.5.2. SetGroupDistinct
8.6. Querying
8.6.1. Query
8.6.2. AddQuery
8.6.3. RunQueries
8.6.4. ResetFilters
8.6.5. ResetGroupBy
8.7. Additional functionality
8.7.1. BuildExcerpts
8.7.2. UpdateAttributes
8.7.3. BuildKeywords
8.7.4. EscapeString
8.7.5. Status
8.7.6. FlushAttributes
8.8. Persistent connections
8.8.1. Open
8.8.2. Close
9. MySQL storage engine (SphinxSE)
9.1. SphinxSE overview
9.2. Installing SphinxSE
9.2.1. Compiling MySQL 5.0.x with SphinxSE
9.2.2. Compiling MySQL 5.1.x with SphinxSE
9.2.3. Checking SphinxSE installation
9.3. Using SphinxSE
9.4. Building snippets (excerpts) via MySQL
10. Reporting bugs
11. sphinx.conf options reference
11.1. Data source configuration options
11.1.1. type
11.1.2. sql_host
11.1.3. sql_port
11.1.4. sql_user
11.1.5. sql_pass
11.1.6. sql_db
11.1.7. sql_sock
11.1.8. mysql_connect_flags
11.1.9. mysql_ssl_cert, mysql_ssl_key, mysql_ssl_ca
11.1.10. odbc_dsn
11.1.11. sql_query_pre
11.1.12. sql_query
11.1.13. sql_joined_field
11.1.14. sql_query_range
11.1.15. sql_range_step
11.1.16. sql_query_killlist
11.1.17. sql_attr_uint
11.1.18. sql_attr_bool
11.1.19. sql_attr_bigint
11.1.20. sql_attr_timestamp
11.1.21. sql_attr_str2ordinal
11.1.22. sql_attr_float
11.1.23. sql_attr_multi
11.1.24. sql_attr_string
11.1.25. sql_attr_str2wordcount
11.1.26. sql_field_string
11.1.27. sql_field_str2wordcount
11.1.28. sql_file_field
11.1.29. sql_query_post
11.1.30. sql_query_post_index
11.1.31. sql_ranged_throttle
11.1.32. sql_query_info
11.1.33. xmlpipe_command
11.1.34. xmlpipe_field
11.1.35. xmlpipe_field_string
11.1.36. xmlpipe_field_wordcount
11.1.37. xmlpipe_attr_uint
11.1.38. xmlpipe_attr_bool
11.1.39. xmlpipe_attr_timestamp
11.1.40. xmlpipe_attr_str2ordinal
11.1.41. xmlpipe_attr_float
11.1.42. xmlpipe_attr_multi
11.1.43. xmlpipe_attr_string
11.1.44. xmlpipe_fixup_utf8
11.1.45. mssql_winauth
11.1.46. mssql_unicode
11.1.47. unpack_zlib
11.1.48. unpack_mysqlcompress
11.1.49. unpack_mysqlcompress_maxsize
11.2. Index configuration options
11.2.1. type
11.2.2. source
11.2.3. path
11.2.4. docinfo
11.2.5. mlock
11.2.6. morphology
11.2.7. min_stemming_len
11.2.8. stopwords
11.2.9. wordforms
11.2.10. exceptions
11.2.11. min_word_len
11.2.12. charset_type
11.2.13. charset_table
11.2.14. ignore_chars
11.2.15. min_prefix_len
11.2.16. min_infix_len
11.2.17. prefix_fields
11.2.18. infix_fields
11.2.19. enable_star
11.2.20. ngram_len
11.2.21. ngram_chars
11.2.22. phrase_boundary
11.2.23. phrase_boundary_step
11.2.24. html_strip
11.2.25. html_index_attrs
11.2.26. html_remove_elements
11.2.27. local
11.2.28. agent
11.2.29. agent_blackhole
11.2.30. agent_connect_timeout
11.2.31. agent_query_timeout
11.2.32. preopen
11.2.33. ondisk_dict
11.2.34. inplace_enable
11.2.35. inplace_hit_gap
11.2.36. inplace_docinfo_gap
11.2.37. inplace_reloc_factor
11.2.38. inplace_write_factor
11.2.39. index_exact_words
11.2.40. overshort_step
11.2.41. stopword_step
11.2.42. hitless_words
11.2.43. expand_keywords
11.2.44. blend_chars
11.2.45. rt_mem_limit
11.2.46. rt_field
11.2.47. rt_attr_uint
11.2.48. rt_attr_bigint
11.2.49. rt_attr_float
11.2.50. rt_attr_timestamp
11.2.51. rt_attr_string
11.3. indexer program configuration options
11.3.1. mem_limit
11.3.2. max_iops
11.3.3. max_iosize
11.3.4. max_xmlpipe2_field
11.3.5. write_buffer
11.3.6. max_file_field_buffer
11.4. searchd program configuration options
11.4.1. listen
11.4.2. address
11.4.3. port
11.4.4. log
11.4.5. query_log
11.4.6. read_timeout
11.4.7. client_timeout
11.4.8. max_children
11.4.9. pid_file
11.4.10. max_matches
11.4.11. seamless_rotate
11.4.12. preopen_indexes
11.4.13. unlink_old
11.4.14. attr_flush_period
11.4.15. ondisk_dict_default
11.4.16. max_packet_size
11.4.17. mva_updates_pool
11.4.18. crash_log_path
11.4.19. max_filters
11.4.20. max_filter_values
11.4.21. listen_backlog
11.4.22. read_buffer
11.4.23. read_unhinted
11.4.24. max_batch_queries
11.4.25. subtree_docs_cache
11.4.26. subtree_hits_cache
11.4.27. workers
11.4.28. dist_threads
11.4.29. binlog_path
11.4.30. binlog_flush
11.4.31. binlog_max_log_size
A. Sphinx revision history
A.1. Version 1.10-beta, 19 jul 2010
A.2. Version 0.9.9-release, 02 dec 2009
A.3. Version 0.9.9-rc2, 08 apr 2009
A.4. Version 0.9.9-rc1, 17 nov 2008
A.5. Version 0.9.8.1, 30 oct 2008
A.6. Version 0.9.8, 14 jul 2008
A.7. Version 0.9.7, 02 apr 2007
A.8. Version 0.9.7-rc2, 15 dec 2006
A.9. Version 0.9.7-rc1, 26 oct 2006
A.10. Version 0.9.6, 24 jul 2006
A.11. Version 0.9.6-rc1, 26 jun 2006
List of Examples
3.1. Ranged query usage example
3.2. XMLpipe document stream
3.3. xmlpipe2 document stream
3.4. Fully automated live updates
4.1. RT index declaration
5.1. Boolean query example
5.2. Extended matching mode: query example
Chapter 1. Introduction
=======================
Table of Contents
1.1. About
1.2. Sphinx features
1.3. Where to get Sphinx
1.4. License
1.5. Credits
1.6. History
1.1. About
==========
Sphinx is a full-text search engine, publicly distributed under GPL version
2. Commercial licensing (eg. for embedded use) is available upon request.
Technically, Sphinx is a standalone software package provides fast and
relevant full-text search functionality to client applications. It was
specially designed to integrate well with SQL databases storing the data,
and to be easily accessed scripting languages. However, Sphinx does not
depend on nor require any specific database to function.
Applications can access Sphinx search daemon (searchd) using any of the
three different access methods: a) via native search API (SphinxAPI), b)
via Sphinx own implementation of MySQL network protocol (using a small SQL
subset called SphinxQL), or c) via MySQL server with a pluggable storage
engine (SphinxSE).
Official native SphinxAPI implementations for PHP, Perl, Ruby, and Java are
included within the distribution package. API is very lightweight so
porting it to a new language is known to take a few hours or days. Third
party API ports and plugins exist for Perl, C#, Haskell, Ruby-on-Rails, and
possibly other languages and frameworks.
Starting version 1.10-beta, Sphinx supports two different indexing
backends: "disk" index backend, and "realtime" (RT) index backend. Disk
indexes support online full-text index rebuilds, but online updates can
only be done on non-text (attribute) data. RT indexes additionally allow
for online full-text index updates. Previous versions only supported disk
indexes.
Data can be loaded into disk indexes using a so-called data source.
Built-in sources can fetch data directly from MySQL, PostgreSQL, ODBC
compliant database (MS SQL, Oracle, etc), or a pipe in a custom XML format.
Adding new data sources drivers (eg. to natively support other DBMSes) is
designed to be as easy as possible. RT indexes, as of 1.10-beta, can only
be populated using SphinxQL.
As for the name, Sphinx is an acronym which is officially decoded as SQL
Phrase Index. Yes, I know about CMU's Sphinx project.
1.2. Sphinx features
====================
Key Sphinx features are:
* high indexing and searching performance;
* advanced indexing and querying tools (flexible and feature-rich text
tokenizer, querying language, several different ranking modes, etc);
* advanced result set post-processing (SELECT with expressions, WHERE,
ORDER BY, GROUP BY etc over text search results);
* proven scalability up to billions of documents, terabytes of data, and
thousands of queries per second;
* easy integration with SQL and XML data sources, and SphinxAPI,
SphinxQL, or SphinxSE search interfaces;
* easy scaling with distributed searches.
To expand a bit, Sphinx:
* has high indexing speed (upto 10-15 MB/sec per core on an internal
benchmark);
* has high search speed (upto 150-250 queries/sec per core against
1,000,000 documents, 1.2 GB of data on an internal benchmark);
* has high scalability (biggest known cluster indexes over 3,000,000,000
documents, and busiest one peaks over 50,000,000 queries/day);
* provides good relevance ranking through combination of phrase
proximity ranking and statistical (BM25) ranking;
* provides distributed searching capabilities;
* provides document excerpts (snippets) generation;
* provides searching from within application with SphinxAPI or SphinxQL
interfaces, and from within MySQL with pluggable SphinxSE storage
engine;
* supports boolean, phrase, word proximity and other types of queries;
* supports multiple full-text fields per document (upto 32 by default);
* supports multiple additional attributes per document (ie. groups,
timestamps, etc);
* supports stopwords;
* supports morphological word forms dictionaries;
* supports tokenizing exceptions;
* supports both single-byte encodings and UTF-8;
* supports stemming (stemmers for English, Russian and Czech are
built-in; and stemmers for French, Spanish, Portuguese, Italian,
Romanian, German, Dutch, Swedish, Norwegian, Danish, Finnish,
Hungarian, are available by building third party libstemmer library);
* supports MySQL natively (all types of tables, including MyISAM,
InnoDB, NDB, Archive, etc are supported);
* supports PostgreSQL natively;
* supports ODBC compliant databases (MS SQL, Oracle, etc) natively;
* ...has 50+ other features not listed here, refer to API and
configuration manual!
1.3. Where to get Sphinx
========================
Sphinx is available through its official Web site at
http://sphinxsearch.com/.
Currently, Sphinx distribution tarball includes the following software:
* indexer: an utility which creates fulltext indexes;
* search: a simple command-line (CLI) test utility which searches
through fulltext indexes;
* searchd: a daemon which enables external software (eg. Web
applications) to search through fulltext indexes;
* sphinxapi: a set of searchd client API libraries for popular Web
scripting languages (PHP, Python, Perl, Ruby).
* spelldump: a simple command-line tool to extract the items from an
ispell or MySpell (as bundled with OpenOffice) format dictionary to
help customize your index, for use with wordforms.
* indextool: an utility to dump miscellaneous debug information about
the index, added in version 0.9.9-rc2.
1.4. License
============
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option)
any later version. See COPYING file for details.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc., 59
Temple Place, Suite 330, Boston, MA 02111-1307 USA
Non-GPL licensing (for OEM/ISV embedded use) can also be arranged, please
contact us to discuss commercial licensing possibilities.
1.5. Credits
============
Author
------
Sphinx initial author (and a benevolent dictator ever since):
* Andrew Aksyonoff, http://shodan.ru
Team
----
Past and present employees of Sphinx Technologies Inc who should be noted
on their work on Sphinx (in alphabetical order):
* Alexander Klimenko
* Alexey Dvoichenkov
* Alexey Vinogradov
* Ilya Kuznetsov
* Stanislav Klinov
Contributors
------------
People who contributed to Sphinx and their contributions (in no particular
order):
* Robert "coredev" Bengtsson (Sweden), initial version of PostgreSQL
data source
* Len Kranendonk, Perl API
* Dmytro Shteflyuk, Ruby API
Many other people have contributed ideas, bug reports, fixes, etc. Thank
you!
1.6. History
============
Sphinx development was started back in 2001, because I didn't manage to
find an acceptable search solution (for a database driven Web site) which
would meet my requirements. Actually, each and every important aspect was
a problem:
* search quality (ie. good relevance)
* statistical ranking methods performed rather bad, especially on
large collections of small documents (forums, blogs, etc)
* search speed
* especially if searching for phrases which contain stopwords, as in
"to be or not to be"
* moderate disk and CPU requirements when indexing
* important in shared hosting enivronment, not to mention the
indexing speed.
Despite the amount of time passed and numerous improvements made in the
other solutions, there's still no solution which I personally would be
eager to migrate to.
Considering that and a lot of positive feedback received from Sphinx users
during last years, the obvious decision is to continue developing Sphinx
(and, eventually, to take over the world).
Chapter 2. Installation
=======================
Table of Contents
2.1. Supported systems
2.2. Required tools
2.3. Installing Sphinx on Linux
2.4. Installing Sphinx on Windows
2.5. Known installation issues
2.6. Quick Sphinx usage tour
2.1. Supported systems
======================
Most modern UNIX systems with a C++ compiler should be able to compile and
run Sphinx without any modifications.
Currently known systems Sphinx has been successfully running on are:
* Linux 2.4.x, 2.6.x (many various distributions)
* Windows 2000, XP
* FreeBSD 4.x, 5.x, 6.x, 7.x
* NetBSD 1.6, 3.0
* Solaris 9, 11
* Mac OS X
CPU architectures known to work include X86, X86-64, SPARC64, ARM.
Chance are good that Sphinx should work on other Unix platforms as well;
please report any platforms missing from this list that worked for you!
2.2. Required tools
===================
On UNIX, you will need the following tools to build and install Sphinx:
* a working C++ compiler. GNU gcc is known to work.
* a good make program. GNU make is known to work.
On Windows, you will need Microsoft Visual C/C++ Studio .NET 2003 or 2005.
Other compilers/environments will probably work as well, but for the time
being, you will have to build makefile (or other environment specific
project files) manually.
2.3. Installing Sphinx on Linux
===============================
1. Extract everything from the distribution tarball (haven't you
already?) and go to the sphinx subdirectory:
| $ tar xzvf sphinx-0.9.8.tar.gz
| $ cd sphinx
2. Run the configuration program:
| $ ./configure
There's a number of options to configure. The complete listing may be
obtained by using --help switch. The most important ones are:
* --prefix, which specifies where to install Sphinx; such as
--prefix=/usr/local/sphinx (all of the examples use this prefix)
* --with-mysql, which specifies where to look for MySQL include and
library files, if auto-detection fails;
* --with-pgsql, which specifies where to look for PostgreSQL include
and library files.
3. Build the binaries:
| $ make
4. Install the binaries in the directory of your choice: (defaults to
/usr/local/bin/ on *nix systems, but is overridden with configure
--prefix)
| $ make install
2.4. Installing Sphinx on Windows
=================================
Installing Sphinx on a Windows server is often easier than installing on
a Linux environment; unless you are preparing code patches, you can use the
pre-compiled binary files from the Downloads area on the website.
1. Extract everything from the .zip file you have downloaded -
sphinx-0.9.8-win32.zip (or sphinx-0.9.8-win32-pgsql.zip if you need
PostgresSQL support as well.) You can use Windows Explorer in Windows
XP and up to extract the files, or a freeware package like 7Zip to
open the archive.
For the remainder of this guide, we will assume that the folders are
unzipped into C:\Sphinx, such that searchd.exe can be found in
C:\Sphinx\bin\searchd.exe. If you decide to use any different
location for the folders or configuration file, please change it
accordingly.
2. Edit the contents of sphinx.conf.in - specifically entries relating
to @CONFDIR@ - to paths suitable for your system.
3. Install the searchd system as a Windows service:
C:\Sphinx\bin> C:\Sphinx\bin\searchd --install --config
C:\Sphinx\sphinx.conf.in --servicename SphinxSearch
4. The searchd service will now be listed in the Services panel within
the Management Console, available from Administrative Tools. It will
not have been started, as you will need to configure it and build
your indexes with indexer before starting the service. A guide to do
this can be found under Quick tour.
During the next steps of the install (which involve running indexer
pretty much as you would on Linux) you may find that you get an error
relating to libmysql.dll not being found. If you have MySQL
installed, you should find a copy of this library in your Windows
directory, or sometimes in Windows\System32, or failing that in the
MySQL core directories. If you do receive an error please copy
libmysql.dll into the bin directory.
2.5. Known installation issues
==============================
If configure fails to locate MySQL headers and/or libraries, try checking
for and installing mysql-devel package. On some systems, it is not
installed by default.
If make fails with a message which look like
| /bin/sh: g++: command not found
| make[1]: *** [libsphinx_a-sphinx.o] Error 127
try checking for and installing gcc-c++ package.
If you are getting compile-time errors which look like
| sphinx.cpp:67: error: invalid application of `sizeof' to
| incomplete type `Private::SizeError<false>'
this means that some compile-time type size check failed. The most probable
reason is that off_t type is less than 64-bit on your system. As a quick
hack, you can edit sphinx.h and replace off_t with DWORD in a typedef for
SphOffset_t, but note that this will prohibit you from using full-text
indexes larger than 2 GB. Even if the hack helps, please report such
issues, providing the exact error message and compiler/OS details, so
I could properly fix them in next releases.
If you keep getting any other error, or the suggestions above do not seem
to help you, please don't hesitate to contact me.
2.6. Quick Sphinx usage tour
============================
All the example commands below assume that you installed Sphinx in
/usr/local/sphinx, so searchd can be found in
/usr/local/sphinx/bin/searchd.
To use Sphinx, you will need to:
1. Create a configuration file.
Default configuration file name is sphinx.conf. All Sphinx programs
look for this file in current working directory by default.
Sample configuration file, sphinx.conf.dist, which has all the
options documented, is created by configure. Copy and edit that
sample file to make your own configuration: (assuming Sphinx is
installed into /usr/local/sphinx/)
| $ cd /usr/local/sphinx/etc
| $ cp sphinx.conf.dist sphinx.conf
| $ vi sphinx.conf
Sample configuration file is setup to index documents table from
MySQL database test; so there's example.sql sample data file to
populate that table with a few documents for testing purposes:
| $ mysql -u test < /usr/local/sphinx/etc/example.sql
2. Run the indexer to create full-text index from your data:
| $ cd /usr/local/sphinx/etc
| $ /usr/local/sphinx/bin/indexer --all
3. Query your newly created index!
To query the index from command line, use search utility:
| $ cd /usr/local/sphinx/etc
| $ /usr/local/sphinx/bin/search test
To query the index from your PHP scripts, you need to:
1. Run the search daemon which your script will talk to:
| $ cd /usr/local/sphinx/etc
| $ /usr/local/sphinx/bin/searchd
2. Run the attached PHP API test script (to ensure that the daemon was
succesfully started and is ready to serve the queries):
| $ cd sphinx/api
| $ php test.php test
3. Include the API (it's located in api/sphinxapi.php) into your own
scripts and use it.
Happy searching!
Chapter 3. Indexing
===================
Table of Contents
3.1. Data sources
3.2. Attributes
3.3. MVA (multi-valued attributes)
3.4. Indexes
3.5. Restrictions on the source data
3.6. Charsets, case folding, and translation tables
3.7. SQL data sources (MySQL, PostgreSQL)
3.8. xmlpipe data source
3.9. xmlpipe2 data source
3.10. Live index updates
3.11. Delta index updates
3.12. Index merging
3.1. Data sources
=================
The data to be indexed can generally come from very different sources: SQL
databases, plain text files, HTML files, mailboxes, and so on. From Sphinx
point of view, the data it indexes is a set of structured documents, each
of which has the same set of fields. This is biased towards SQL, where each
row correspond to a document, and each column to a field.
Depending on what source Sphinx should get the data from, different code is
required to fetch the data and prepare it for indexing. This code is called
data source driver (or simply driver or data source for brevity).
At the time of this writing, there are drivers for MySQL and PostgreSQL
databases, which can connect to the database using its native C/C++ API,
run queries and fetch the data. There's also a driver called xmlpipe, which
runs a specified command and reads the data from its stdout. See
Section 3.8, <<xmlpipe data source>> section for the format description.
There can be as many sources per index as necessary. They will be
sequentially processed in the very same order which was specifed in index
definition. All the documents coming from those sources will be merged as
if they were coming from a single source.
3.2. Attributes
===============
Attributes are additional values associated with each document that can be
used to perform additional filtering and sorting during search.
It is often desired to additionally process full-text search results based
not only on matching document ID and its rank, but on a number of other
per-document values as well. For instance, one might need to sort news
search results by date and then relevance, or search through products
within specified price range, or limit blog search to posts made by
selected users, or group results by month. To do that efficiently, Sphinx
allows to attach a number of additional attributes to each document, and
store their values in the full-text index. It's then possible to use stored
values to filter, sort, or group full-text matches.
Attributes, unlike the fields, are not full-text indexed. They are stored
in the index, but it is not possible to search them as full-text, and
attempting to do so results in an error.
For example, it is impossible to use the extended matching mode expression
@column 1 to match documents where column is 1, if column is an attribute,
and this is still true even if the numeric digits are normally indexed.
Attributes can be used for filtering, though, to restrict returned rows, as
well as sorting or result grouping; it is entirely possible to sort results
purely based on attributes, and ignore the search relevance tools.
Additionally, attributes are returned from the search daemon, while the
indexed text is not.
A good example for attributes would be a forum posts table. Assume that
only title and content fields need to be full-text searchable - but that
sometimes it is also required to limit search to a certain author or
a sub-forum (ie. search only those rows that have some specific values of
author_id or forum_id columns in the SQL table); or to sort matches by
post_date column; or to group matching posts by month of the post_date and
calculate per-group match counts.
This can be achieved by specifying all the mentioned columns (excluding
title and content, that are full-text fields) as attributes, indexing them,
and then using API calls to setup filtering, sorting, and grouping. Here as
an example.
Example sphinx.conf part:
-------------------------
| ...
| sql_query = SELECT id, title, content, \
| author_id, forum_id, post_date FROM my_forum_posts
| sql_attr_uint = author_id
| sql_attr_uint = forum_id
| sql_attr_timestamp = post_date
| ...
Example application code (in PHP):
----------------------------------
| // only search posts by author whose ID is 123
| $cl->SetFilter ( "author_id", array ( 123 ) );
|
| // only search posts in sub-forums 1, 3 and 7
| $cl->SetFilter ( "forum_id", array ( 1,3,7 ) );
|
| // sort found posts by posting date in descending order
| $cl->SetSortMode ( SPH_SORT_ATTR_DESC, "post_date" );
Attributes are named. Attribute names are case insensitive. Attributes are
not full-text indexed; they are stored in the index as is. Currently
supported attribute types are:
* unsigned integers (1-bit to 32-bit wide);
* UNIX timestamps;
* floating point values (32-bit, IEEE 754 single precision);
* string ordinals (specially computed integers);
* strings (since 1.10-beta);
* MVA, multi-value attributes (variable-length lists of 32-bit unsigned
integers).
The complete set of per-document attribute values is sometimes referred to
as docinfo. Docinfos can either be
* stored separately from the main full-text index data ("extern"
storage, in .spa file), or
* attached to each occurence of document ID in full-text index data
("inline" storage, in .spd file).
When using extern storage, a copy of .spa file (with all the attribute
values for all the documents) is kept in RAM by searchd at all times. This
is for performance reasons; random disk I/O would be too slow. On the
contrary, inline storage does not require any additional RAM at all, but
that comes at the cost of greatly inflating the index size: remember that
it copies all attribute value every time when the document ID is mentioned,
and that is exactly as many times as there are different keywords in the
document. Inline may be the only viable option if you have only a few
attributes and need to work with big datasets in limited RAM. However, in
most cases extern storage makes both indexing and searching much more
efficient.
Search-time memory requirements for extern storage are
(1+number_of_attrs)*number_of_docs*4 bytes, ie. 10 million docs with
2 groups and 1 timestamp will take (1+2+1)*10M*4 = 160 MB of RAM. This is
PER DAEMON, not per query. searchd will allocate 160 MB on startup, read
the data and keep it shared between queries. The children will NOT allocate
any additional copies of this data.
3.3. MVA (multi-valued attributes)
==================================
MVAs, or multi-valued attributes, are an important special type of
per-document attributes in Sphinx. MVAs make it possible to attach lists of
values to every document. They are useful for article tags, product
categories, etc. Filtering and group-by (but not sorting) on MVA attributes
is supported.
Currently, MVA list entries are limited to unsigned 32-bit integers. The
list length is not limited, you can have an arbitrary number of values
attached to each document as long as RAM permits (.spm file that contains
the MVA values will be precached in RAM by searchd). The source data can be
taken either from a separate query, or from a document field; see source
type in sql_attr_multi. In the first case the query will have to return
pairs of document ID and MVA values, in the second one the field will be
parsed for integer values. There are absolutely no requirements as to
incoming data order; the values will be automatically grouped by document
ID (and internally sorted within the same ID) during indexing anyway.
When filtering, a document will match the filter on MVA attribute if any of
the values satisfy the filtering condition. (Therefore, documents that pass
through exclude filters will not contain any of the forbidden values.) When
grouping by MVA attribute, a document will contribute to as many groups as
there are different MVA values associated with that document. For instance,
if the collection contains exactly 1 document having a 'tag' MVA with
values 5, 7, and 11, grouping on 'tag' will produce 3 groups with '@count'
equal to 1 and '@groupby' key values of 5, 7, and 11 respectively. Also
note that grouping by MVA might lead to duplicate documents in the result
set: because each document can participate in many groups, it can be chosen
as the best one in in more than one group, leading to duplicate IDs. PHP
API historically uses ordered hash on the document ID for the resulting
rows; so you'll also need to use SetArrayResult() in order to employ
group-by on MVA with PHP API.
3.4. Indexes
============
To be able to answer full-text search queries fast, Sphinx needs to build
a special data structure optimized for such queries from your text data.
This structure is called index; and the process of building index from text
is called indexing.
Different index types are well suited for different tasks. For example,
a disk-based tree-based index would be easy to update (ie. insert new
documents to existing index), but rather slow to search. Therefore, Sphinx
architecture allows for different index types to be implemented easily.
The only index type which is implemented in Sphinx at the moment is
designed for maximum indexing and searching speed. This comes at a cost of
updates being really slow; theoretically, it might be slower to update this
type of index than than to reindex it from scratch. However, this very
frequently could be worked around with muiltiple indexes, see Section 3.10,
<<Live index updates>> for details.
It is planned to implement more index types, including the type which would
be updateable in real time.
There can be as many indexes per configuration file as necessary. indexer
utility can reindex either all of them (if --all option is specified), or
a certain explicitly specified subset. searchd utility will serve all the
specified indexes, and the clients can specify what indexes to search in
run time.
3.5. Restrictions on the source data
====================================
There are a few different restrictions imposed on the source data which is
going to be indexed by Sphinx, of which the single most important one is:
ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS (32-BIT
OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).
If this requirement is not met, different bad things can happen. For
instance, Sphinx can crash with an internal assertion while indexing; or
produce strange results when searching due to conflicting IDs. Also,
a 1000-pound gorilla might eventually come out of your display and start
throwing barrels at you. You've been warned.
3.6. Charsets, case folding, and translation tables
===================================================
When indexing some index, Sphinx fetches documents from the specified
sources, splits the text into words, and does case folding so that "Abc",
"ABC" and "abc" would be treated as the same word (or, to be pedantic,
term).
To do that properly, Sphinx needs to know
* what encoding is the source text in;
* what characters are letters and what are not;
* what letters should be folded to what letters.
This should be configured on a per-index basis using charset_type and
charset_table options. charset_type specifies whether the document encoding
is single-byte (SBCS) or UTF-8. charset_table specifies the table that maps
letter characters to their case folded versions. The characters that are
not in the table are considered to be non-letters and will be treated as
word separators when indexing or searching through this index.
Note that while default tables do not include space character (ASCII code
0x20, Unicode U+0020) as a letter, it's in fact perfectly legal to do so.
This can be useful, for instance, for indexing tag clouds, so that
space-separated word sets would index as a single search query term.
Default tables currently include English and Russian characters. Please do
submit your tables for other languages!
3.7. SQL data sources (MySQL, PostgreSQL)
=========================================
With all the SQL drivers, indexing generally works as follows.
* connection to the database is established;
* pre-query (see Section 11.1.11, <<sql_query_pre>>) is executed to
perform any necessary initial setup, such as setting per-connection
encoding with MySQL;
* main query (see Section 11.1.12, <<sql_query>>) is executed and the
rows it returns are indexed;