○。o傻〃仔〃仔o。: MySQL+Sphinx+SphinxSE安裝 (全文搜索連中文分詞)

MySQL+Sphinx+SphinxSE安裝 (全文搜索連中文分詞)

一、MySQL+Sphinx+SphinxSE安裝步驟：
　　1、安裝python支持（以下針對CentOS系統，其他Linux系統請使用相應的方法安裝）
yum install -y python python-devel

　　2、編譯安裝LibMMSeg（LibMMSeg是為Sphinx全文搜索引擎設計的中文分詞軟件包，其在GPL協議下發行的中文分詞法，采用Chih-Hao Tsai的MMSEG算法。LibMMSeg在本文中用來生成中文分詞詞庫。）

　　以下壓縮包「sphinx-0.9.8-rc2-chinese.zip」中包含mmseg-0.7.3.tar.gz、sphinx-0.9.8-rc2.tar.gz以及中文分詞補丁。

wget http://www.coreseek.com/uploads/sources/csft3_0b2.tar.gz
wget http://www.coreseek.com/uploads/sources/mmseg3_0b2.tar.gz
unzip sphinx-0.9.8-rc2-chinese.zip
tar zxvf mmseg3_0b2.tar.gz
cd mmseg3_0b2/
./configure
make
make install
cd ../

　　3、編譯安裝MySQL 5.1.26-rc、Sphinx、SphinxSE存儲引擎
wget http://dev.mysql.com/get/Downloads/MySQL-5.1/mysql-5.1.26-rc.tar.gz/from/http://mirror.x10.com/mirror/mysql/
tar zxvf mysql-5.1.26-rc.tar.gz

tar zxvf csft3_0b2.tar.gz
cd csft3_0b2.tar.gz/
patch -p1 < ../sphinx-0.98rc2.zhcn-support.patch
patch -p1 < ../fix-crash-in-excerpts.patch
cp -rf mysqlse ../mysql-5.1.26-rc/storage/sphinx
cd ../

cd mysql-5.1.26-rc/
sh BUILD/autorun.sh
./configure --with-plugins=sphinx --prefix=/usr/local/mysql1/ --enable-assembler --with-extra-charsets=complex --enable-thread-safe-client --with-big-tables --with-readline --with-ssl --with-embedded-server --enable-local-infile
make && make install
cd ../
檢查下是否安裝好sphinx    show engines; 有個sphinx引擎

cd csft3_0b2.tar.gz/
CPPFLAGS=-I/usr/include/python2.4
LDFLAGS=-lpython2.4
./configure --prefix=/usr/local/sphinx --with-mysql=/usr/local/mysql1
make
make install
cd ../

cp /usr/local/sphinx/etc/sphinx.conf.dist /usr/local/sphinx/etc/sphinx.conf

    4、創建Sphinx索引文件和MySQL數據文件存放目錄

./usr/local/sphinx/bin/indexer test1 --config /usr/local/sphinx/etc/sphinx.conf

/usr/local/mysql1/bin/mysql_install_db --datadir=/usr/local/mysql1/var

    5、創建MySQL配置文件
        (1)、創建配置文件/mysql/3306/my.cnf

        cd mysql-5.1.26-rc/
        cp support-files/my-medium.cnf /mysql/3306/my.cnf
        vim /mysql/3306/my.cnf
        server_id=2(不同於主庫和3406)
        port=3306

        (2)、創建配置文件/mysql/3406/my.cnf
        cd mysql-5.1.26-rc/
        cp support-files/my-medium.cnf /mysql/3306/my.cnf
        vim /mysql/3306/my.cnf
        server_id=3(不同於主庫和3306)
        port=3406
    6、制作一份MySQL slave供搜索引擎使用
    7、創建快捷啟動、停止重啟、殺死MySQL進程的腳本
        cp support-files/mysqlserver /etc/rc.d/init.d/mysql
        vim /etc/rc.d/init.d/mysql
        conf=/mysql/3306/my.cnf
        $bindir/mysqld_safe --defaults-file=/mysql/3306/my.cnf --datadir=$datadir --pid-file=$server_pid_file $other_args >/dev/null 2>&1 &
二、Sphinx配置

1、生成sphinx中文分詞詞庫
    (1)、詞典的構造

mmseg -u unigram.txt

該命令執行後，將會產生一個名為unigram.txt.uni的文件，將該文件改名為uni.lib，完成詞典的構造。需要注意的是，unigram.txt 必須為UTF-8編碼。

    (2)、詞典文件格式
....
河 187
x:187
造假者 1
x:1
台北隊 1
x:1
湖邊 1
......

其中，每條記錄分兩行。其中，第一行為詞項，其格式為：[詞條]\t[詞頻率]。需要注意的是，對於單個字後面跟這個字作單字成詞的頻率，這個頻率需要在大量的預先切分好的語料庫中進行統計，用戶增加或刪除詞時，一般不需要修改這個數值；對於非單字詞，詞頻率處必須為1。第二行為占位項，是由於 LibMMSeg庫的代碼是從Coreseek其他的分詞算法庫（N-gram模型）中改造而來的，在原來的應用中，第二行為該詞在各種詞性下的分布頻率。LibMMSeg的用戶只需要簡單的在第二行處填"x:1"即可。

用戶可以通過修改詞典文件增加自己的自定義詞，以提高分詞法在某一具體領域的切分精度，系統默認的詞典文件在data/unigram.txt中。

    (3)、Sphinx+MySQL搜索引擎的中文詞庫

2、創建Sphinx主索引文件、增量索引文件存放目錄
mkdir /usr/local/sphinx/var/data/test1/
mkdir /usr/local/sphinx/var/data/test1stemmed/
3、創建Sphinx配置文件
#in MySQL
CREATE TABLE sphcounter
(
   counterid INTEGER PRIMARY KEY NOT NULL,
   max_doc_id INTEGER NOT NULL
);
#創建這張表用來標識上次重建主索引的id位置
# in sphinx.conf
source src1
{
    type                    = mysql
    sql_host                = localhost
    sql_user                = root
    sql_pass                = 123
    sql_db                    = test
    sql_port                = 3306    # optional, default is 3306
    sql_sock                = /usr/local/mysql1/var/mysql.sock#以上都是用於連接數據庫部分一看就懂
    sql_query_pre            = SET NAMES utf8
    sql_query_pre            =replace into sphcounter \
        select 1,MAX(postid) from pa_gposts #創建主索引前更改標識位置
    sql_query                = \
        SELECT postid, title,group_id \
                FROM pa_gposts where postid <= \
        (select max_doc_id from sphcounter where counterid=1)#主索引是id小於標識位置的部分
    sql_attr_uint        = group_id#這個部分不被索引，但可以通過這個屬性對結果進行排序
    sql_ranged_throttle    = 0#每個查詢之前先延遲0ms，也就是不延遲
    #sql_query_info        = SELECT * FROM pa_gposts WHERE postid=$id
}
source src1throttled : src1
{
   sql_query_pre=set names utf8
    sql_query=SELECT postid, title \
                FROM pa_gposts where postid >\
                (select max_doc_id from sphcounter where counterid=1) #增量索引是id大於標識位置的部分
}
index test1
{
    source            = src1 #數據源
    path            = /usr/local/sphinx/var/data/test1/test1 #創建索引位置必須有目錄/usr/local/sphinx/var/data/test1/
    docinfo            = extern
    mlock            = 0
    min_word_len        = 1
    charset_type        = zh_cn.utf-8#支持中文索引必須為zh_cn.utf-8
    charset_dictpath=/root/mmseg-0.7.3/data/ #詞典的目錄，詞典下必須有uni.lib mmseg 生產的詞典
     min_prefix_len    = 0
     min_infix_len        = 1
     ngram_len                = 1
    ngram_chars = U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF,\
U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF,\
U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF,\
U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF
    html_strip                = 0#不去除HTML標簽
#其他的配置如min_word_len,charset_type,ngrams_chars,ngram_len這些則是支持中文檢索需要設置的內容。

}
index test1stemmed : test1
{
    source                  =src1throttled
    path            = /usr/local/sphinx/var/data/test1stemmed/test1stemmed
}
indexer
{
    mem_limit            = 256M
}
searchd
{
    port                = 3312
    log                    = /usr/local/sphinx/var/log/searchd.log
    query_log            = /usr/local/sphinx/var/log/query.log
    read_timeout        = 5
    max_children        = 30
    pid_file            = /usr/local/sphinx/var/log/searchd.pid
    max_matches            = 1000
    seamless_rotate        = 1
    preopen_indexes        = 0
    unlink_old            = 1
}

4、初始化sphinx中配置的全部索引
/usr/local/sphinx/bin/indexer --all --config /usr/local/sphinx/etc/sphinx.conf
5、創建2個shell腳本，一個用來創建主索引、一個用來創建增量索引（此步可以省略）

    1.創建主索引腳本build_main_index.sh
            #!/bin/sh
            /usr/local/sphinx/bin/searchd --stop>>searchdlog
            /usr/local/sphinx/bin/indexer test1 --config /usr/local/sphinx/etc/sphinx.conf>>mainindexlog
            /usr/local/sphinx/bin/searchd>>searchdlog
    賦予執行權限
        chmod u+x build_main_index.sh
    定時執行腳本
        crontab -e
        添加一行 ./root/build_delta_index.sh
    2.創建增量索引腳本build_delta_index.sh
        #!/bin/sh
        /usr/local/sphinx/bin/searchd --stop >> searchdlog
        /usr/local/sphinx/bin/indexer test1stemmed --config /usr/local/sphinx/etc/sphinx.conf >> deltaindexlog
        /usr/local/sphinx/bin/indexer --merge test1 test1stemmed --config /usr/local/sphinx/etc/sphinx.conf >> deltaindexlog
        /usr/local/sphinx/bin/searchd >> searchdlog

6、啟動Sphinx守護進程
    /usr/local/sphinx/bin/searchd --config /usr/local/sphinx/etc/sphinx.conf
關閉 /usr/local/sphinx/bin/searchd --config /usr/local/sphinx/etc/sphinx.conf --stop
7、配置服務器開機啟動時需要自動執行的命令
8、創建Sphinx存儲引擎表
CREATE TABLE `sphinx` (
`id` int(11) NOT NULL,
`weight` int(11) NOT NULL,
`query` varchar(255) NOT NULL,
`group_id` int(11) NOT NULL,
KEY `Query` (`Query`)
) ENGINE=SPHINX CONNECTION='sphinx://localhost:3312/test1';
與一般mysql表不同的是ENGINE=SPHINX CONNECTION='sphinx://localhost:3312/test1';，這�表示這個表采用SPHINXSE引擎，與sphinx的連接串是'sphinx://localhost:3312/test1，test1是索引名稱
根據sphinx官方說明，這個表必須至少有三個字段，字段起什麼名稱無所謂，但類型的順序必須是integer,integer,varchar，分別表示記錄標識document ID,匹配權重weight與查詢query，同時document ID與query必須建索引。另外這個表還可以建立幾個字段，這幾個字段的只能是integer或TIMESTAMP類型，字段是與sphinx的結果集綁定的，因此字段的名稱必須與在sphinx.conf中定義的屬性名稱一致，否則取出來的將是Null值。

比如我在上面有定義了sql_attr_uint= group_id那麼在這個表�頭，你就可以再定義group_id字段。

三、如何通過SQL語句調用搜索引擎
    1。簡單的查詢
        select * from sphinx where query='動畫';
    2。聯合查詢：
        select docs.title from test.pa_gposts docs join sphinx on (docs.postid=sphinx.id) where query='制作;limit=1000';

query='關鍵字' ，關鍵字就是你要搜索的關鍵字，如query='CGArt'表示你要全文搜索CGArt

mode，搜索模式，值有：all,any,phrase,boolean,extended，默認是all
    all, 匹配所有查詢詞（默認模式）
    any, 匹配查詢詞中的任意一個
    phrase, 將整個查詢看作一個詞組，要求按順序完整匹配
    boolean, 將查詢看作一個布爾表達式（參見節 4.2, 「布爾查詢語法」)
    extended, 將查詢看作一個 Sphinx 內部查詢語言的表達式（參見節 4.3, 「擴展的查詢語法」）

sort，排序模式，必須是relevance,attr_desc,attr_asc,time_segments,extended中的一種，在所有模式中除了relevance外，
       屬性名（或用extended排序）前面都需要一個冒號。
   ... where query='test;sort=attr_asc:group_id';按照group_id升序排序
      ... where query='test;sort=extended:@weight desc,group_id asc';
    relevance 模式, 按相關度降序排列（最好的匹配排在最前面）
   attr_desc 模式, 按屬性降序排列（屬性值越大的越是排在前面）
    attr_asc 模式, 按屬性升序排列（屬性值越小的越是排在前面）
   time_segments 模式, 先按時間段（最近一小時/天/周/月）降序，再按相關度降序
   extended 模式, 按一種類似 SQL 的方式將列組合起來，升序或降序排列。
        RELEVANCE 忽略任何附加的參數，永遠按相關度評分排序。所有其餘的模式都要求額外的排序子句，子句的語法跟具體的模式有關。
            ATTR_ASC,ATTR_DESC 以及 TIME_SEGMENTS 這三個模式僅要求一個屬性名。
        RELEVANCE 模式等價於在擴展模式中按"@weight DESC, @id ASC"排序，
        ATTR_ASC 模式等價於"attribute ASC, @weight DESC, @id ASC"，而
        ATTR_DESC 等價於"attribute DESC, @weight DESC, @id ASC"。
        TIME_SEGMENTS 模式在 TIME_SEGMENTS 模式中，屬性值被分割成「時間段」，然後先按時間段排序，再按相關度排序。
        EXTENDED 模式在 EXTENDED 模式中，您可以指定一個類似 SQL 的排序表達式，但涉及的屬性（包括內部屬性）不能超過 5 個，例如：
                @relevance DESC, group_id ASC, @id DESC
            已知的內部屬性：
                @id (match ID)
              @weight (match weight)
              @rank (match weight)
              @relevance (match weight)
                @rank 和@relevance 只是@weight 的額外別名。

offset，結果記錄集的起始位置，默認是0

limit，從結果記錄集中取出的數量，默認是20條

index，要搜索的索引名稱
... where query='test;index=test1';
... where query='test;index=test1,test2,test3;';

minid,maxid，匹配最小與最大文檔ID
weights，以逗號分割的分配給sphinx全文檢索字段的權重列表
   ... where query='test;weights=1,2,3;';
filter,!filter，以逗號分隔的屬性名與一堆要匹配的值
   #只包括1,5,19的組
   ... where query='test;filter=group_id,1,5,19;';
#不包括3,11的組
   ... where query='test;!filter=group_id,3,11';
range,!range，逗號分隔的屬性名一最小與最大要匹配的值
   #從3至7的組
   ... where query='test;range=group_id,3,7;';
   #不包括從5至25的組
   ... where query='test;!range=group_id,5,25;';
maxmatches，每個查詢最大匹配的值
   ... where query='test;maxmatches=2000;';
groupby，group by 方法與屬性
   ... where query='test;groupby=day:published_ts;';
   ... where query='test;groupby=attr:group_id;';
groupsort，group by 的排序
   ... where query='test;gropusort='@count desc';

select count(*) from pa_gposts docs join sphinx on (docs.postid=sphinx.id) where query='動畫;limit=1000';

搜索標題包含動畫
select count(*) from pa_gposts docs join sphinx on (docs.postid=sphinx.id) where query='@title動畫;limit=100000;mode=extended';

四、添加分詞的操作及效果
    1.添加分詞兒童動畫片
        select docs.title from pa_gposts docs join sphinx on (docs.postid=sphinx.id) where query='兒童動畫片;limit=100000';
+--------------------------------------------------------------------------------------------------+
| title                                                                                            |
+--------------------------------------------------------------------------------------------------+
| 兒童動畫片兒童影視/動畫連續劇迅雷下載集                                     |
| 發精彩兒童動畫片10部，下載從速                                                     |
| 【兒童節專題】【17部經典動畫片下載,附名單】                                  |
| <span style="color:red">[圖]</span>兒童安全教育動畫片《平安》                      |
| 十五部國產兒童動畫片下載                                                             |
| 推薦不用注冊就能下載數千首兒童歌曲、動畫片、遊戲、故事等育兒資源 |
| 求兒童動畫片                                                                               |
| 兒童歌曲、兒童故事、兒童動畫片下載                                              |
| 兒童動畫片--童話合集23部                                                               |
+--------------------------------------------------------------------------------------------------+
9 rows in set (0.00 sec)
        沒添加之前被分割成兒童/動畫片

    vim unigram.txt    添加下面2行（參見2.1.2詞典的格式）
兒童動畫片 1
x:1
    （附）查看分詞
    mmseg -d <dict_dir> tobe_segment.txt
    其中，命令使用『-d』開關指定詞庫文件所在的位置，參數dict_dir為詞庫文件（uni.lib ）所在的目錄；tobe_segment.txt 為待切分的文本文件，必須為UTF-8編碼。如果一切正確，mmseg會將切分結果以及所花費的時間顯示到標准輸出上。
    mmseg -d mmseg-0.7.3/data a
    論壇/x �/x 有/x 沒有/x 迪/x 斯/x 尼/x 的/x 小公/x 主/x 動畫片/x ，/x 睡/x 美人/x ，/x 阿/x 拉丁/x ，/x 灰姑娘/x
    2。生成字典
mmseg -u unigram.txt uni.lib
    3。重啟服務器重建索引
mysql restart 因為mysql的告訴緩存所以要重啟mysql

bin/searchd --stop

bin/indexer test1

bin/searchd
    4。查看結果
mysql> select docs.title from pa_gposts docs join sphinx on (docs.postid=sphinx.id) where query='兒童動畫片;limit=100000';
+--------------------------------------------------------------+
| title                                                        |
+--------------------------------------------------------------+
| 發精彩兒童動畫片10部，下載從速                 |
| 十五部國產兒童動畫片下載                         |
| 兒童動畫片兒童影視/動畫連續劇迅雷下載集 |
| 求兒童動畫片                                           |
| 兒童歌曲、兒童故事、兒童動畫片下載          |
| 兒童動畫片--童話合集23部                           |
+--------------------------------------------------------------+
6 rows in set (0.06 sec)

        添加之後只搜出兒童動畫片

五、增量索引測試

    1。原始數據
    mysql> select docs.title from pa_gposts docs join sphinx on (docs.postid=sphinx.id) where query='詞典;limit=100000';
+-----------------------------------------------------------------------------+
| title                                                                       |
+-----------------------------------------------------------------------------+
| 孕婦小詞典                                                             |
| 征婚魔鬼詞典                                                          |
| 和大家分享一個很棒的在線學習詞典，對小孩很有幫助的 |
| [轉貼]女人流行詞典                                                  |
| 你不得不看的魔鬼詞典                                              |
+-----------------------------------------------------------------------------+
5 rows in set (0.13 sec)
    insert into pa_gposts (title) values('詞典的構造');
    bin/searchd --stop
    2。創建增量索引
    bin/indexer test1stemmed --config /usr/local/sphinx/etc/sphinx.conf
    3。合並索引
    bin/indexer --merge test1 test1stemmed --config /usr/local/sphinx/etc/sphinx.conf
    bin/searchd
    4。查看結果
    mysql> select docs.title from pa_gposts docs join sphinx on (docs.postid=sphinx.id) where query='詞典;limit=100000';
+-----------------------------------------------------------------------------+
| title                                                                       |
+-----------------------------------------------------------------------------+
| 孕婦小詞典                                                             |
| 征婚魔鬼詞典                                                          |
| 和大家分享一個很棒的在線學習詞典，對小孩很有幫助的 |
| [轉貼]女人流行詞典                                                  |
| 你不得不看的魔鬼詞典                                              |
| 詞典的構造                                                             |
+-----------------------------------------------------------------------------+
6 rows in set (0.08 sec)

Source From google

Link from 囧

This entry was posted by ○。o傻〃仔〃仔o。, on 2009年2月17日星期二. You can leave your response.

MySQL+Sphinx+SphinxSE安裝 (全文搜索連中文分詞)

var a = 0; if(a == 0) {document.write('There is no respond to');} else if(a == 1) {document.write('One respond to');}else{document.write(a+' responses to');}There is no respond to “ MySQL+Sphinx+SphinxSE安裝 (全文搜索連中文分詞) ”

There is no respond to “ MySQL+Sphinx+SphinxSE安裝 (全文搜索連中文分詞) ”