solr

thrillerzw

浏览: 139154 次
性别:
来自: 北京

最近访客更多访客>>

日出斯图加特

yangleleaa

smxly53

米糠杰

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

搜索

一、简介

基于lucene

Solr 主要特性有：强大的全文检索功能，高亮显示检索结果，动态集群，数据库接口和电子文档（Word，PDF等）的处理。而且 Solr 具有高度的可扩展，支持分布搜索和索引的复制

solr wiki : http://wiki.apache.org/solr/FrontPage

二、jetty简易启动

cd D:\sorl\solr-4.6.0\example

java -jar start.jar

http://127.0.0.1:8983/solr

引用的是D:\sorl\solr-4.6.0\example\webapps\solr.war 如果要改war包比较麻烦，比如加入ik

三、solr+Tomcat部署

1、拷贝出E:\sorl\solr-4.5.1\example\solr 作为sorlhome, E:\sorl\solrhome（改名）

2、将E:\sorl\solr-4.5.1\dist\solr-4.5.1.war拷贝到Tomcat的webapp目录下，启动，解压后改名为solr

3、配置solrhome为E:/sorl/solrhome

去掉 webapps->solr-> web.xml里env-entry的注释，

<env-entry>

<env-entry-name>solr/home</env-entry-name>

<env-entry-value>E:/sorl/solrhome</env-entry-value>

<env-entry-type>java.lang.String</env-entry-type>

</env-entry>

或者tomcat增加 E:\sorl\apache-tomcat-6.0.29\conf\Catalina\localhost\solr.xml

<?xml version="1.0" encoding="UTF-8"?>
<Context  docBase="E:/solr/apache-tomcat-6.0.29/webapps/solr" debug="5" crossContext="true" >    
    <Environment name="solr/home" type="java.lang.String" value="E:/solr/solrhome" override="true" />    
</Context>

或者在server.xml的host中配置context

4、启动，发现报错，Error filterstart ，因为 war包中没有配置日志。将solr目录下 example\lib\ext中的jar包“、example\resources中的log4j.properties copy到tomcat的lib目录下。

5 、启动正常后，访问 http://localhost:8080/solr/ 进入主目录。

三、中文分词

参考我的 http://thrillerzw.iteye.com/blog/2049172

四、solr+eclipse调试环境搭建

1、下载solr-4.5.1-src.tgz，解压

2、ant添加ivy支持：运行ant ivy-bootstrap ，安装ivy后，C:\Users\Administrator\.ant\lib多出ivy-2.3.0.jar，拷贝到apache-ant-1.8.4\lib目录中。

3、cmd进入工程目录，ant eclipse编译为eclipse工程，

4、导入eclipse

5、在源码中新建一个WebContent文件夹，复制 solr-4.5.1-src\solr\webapp\web 下的内容至 WebContent；复制 solr-4.5.1-src \solr\example中的solr文件夹复制到 WebContent中，作为solr/home

6、eclipse marketplace搜索并安装jetty插件run-jetty-run

7、run Configurations,右击jetty webapp,新建,配置端口为80等，在Arguments面板中的VM arguments中增加 -Dsolr.solr.home=WebContent/solr

8、运行jetty,浏览器访问http://127.0.0.1/solrsrc

五、demo

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrQuery.ORDER;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.response.FacetField;
import org.apache.solr.client.solrj.response.FacetField.Count;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.response.RangeFacet;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.SolrInputDocument;

public class SolrDemo {
    	//索引
	public static void solrIndex(){
		try {
			String url = "http://localhost:8080/solr";
			HttpSolrServer server = new HttpSolrServer(url);
			SolrInputDocument doc = new SolrInputDocument();
			doc.addField("id", "1");
			doc.addField("name", "马航失联");
			doc.addField("age", 25);
			doc.addField("content", "家人好担心");
			doc.addField("testik", "希望马航顺利回来，家里人真的真的爱你，加油");
			
			SolrInputDocument doc2 = new SolrInputDocument();
			doc2.addField("id", "2");
			doc2.addField("name", "加油马航");
			doc2.addField("age", 30);
			doc2.addField("content", "必须加油2014");
			doc2.addField("testik", "马航加油，大家都在等你们回来");
			
			SolrInputDocument doc3 = new SolrInputDocument();
			doc3.addField("id", "3");
			doc3.addField("name", "软件很累");
			doc3.addField("age", 30);
			doc3.addField("content", "喜欢还好，一天一天");
			doc3.addField("testik", "测试ik,马航加油，thriller加油");
			server.add(doc);
			server.add(doc2);
			server.add(doc3);
			server.commit();
		} catch (SolrServerException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		
	}
	
	//查询
	public static void solrSearcher(){
		try {
			String url = "http://localhost:8080/solr";
			HttpSolrServer server = new HttpSolrServer(url);
			/**
			 * 1、在程序中设定的搜索域优先级 > schema.xml文件中配置的默认搜索域
			 * 2、在搜索词前面加入搜索域的优先级 > solrconfig.xml中配置的qf值 > schema.xml文件中配置的默认搜索域
			 * 3、在程序中用setFields函数可以指定显示域，并且该指定方法的优先级 > solrconfing.xml中配置的fl值
			 */
			//AND OR 添加多个
			SolrQuery query = new SolrQuery("testik:加油"); 
			//默认搜索域
			//params.setParam("df", "name");
			
			//显示域
			String[] fields = {"id","name","content","testik","age"};
			query.setFields(fields);
			
			//高亮
			query.addHighlightField("testik");
			query.setHighlight(true);
			query.setHighlightSimplePre("<em class=\"highlight\" >");
			query.setHighlightSimplePost("</em>");
			//显示的字数
			query.setHighlightFragsize(4);
			//排序，可以添加多个。先加入的 优先级高
			query.addSort("age", ORDER.asc);
			query.addSort("id", ORDER.desc);
		
			
			//过滤：包含的留下  达到精确搜索   可添加多个
			String[] fqs = {"testik:加油"};
//			String[] fqs = {"testik:加油","name:加油马航"};
			query.addFilterQuery(fqs);
			
			//分页   起始位置   每页条数   todo:lucene、solr内存溢出问题
			query.setStart(0);
			query.setRows(10);
			
			//facet 
			//FacetField统计的域
			String[] ftf = {"name","age"};
			query.addFacetField(ftf);
			//RangeFacet统计，从1开始，到28结束，每隔10个统计一次。最后一次大于28的也会统计进去。
			query.addNumericRangeFacet("age", 1, 28, 10);
			
			QueryResponse response = server.query(query);
			
			 //"name","age" FacetField统计结果
			 List<FacetField> listField = response.getFacetFields();
			 for(FacetField facetField : listField){
				 System.out.println(facetField.getName());
				 List<Count> counts = facetField.getValues();
				 for(Count c : counts){
					 System.out.println(c.getName()+":"+c.getCount());
				 }
			 }
			//age RangeFacet统计结果
			 List<RangeFacet> listFacet = response.getFacetRanges();
			 for(RangeFacet rf : listFacet){
				 List<RangeFacet.Count> listCounts = rf.getCounts();
				 for(RangeFacet.Count count : listCounts){
					 System.out.println("RangeFacet:"+count.getValue()+":"+count.getCount());
				 }
			 }
			
			SolrDocumentList list = response.getResults();
			//第一个map key:document.getFieldValue("id")文档id值    第二个map key:高亮的域名 
			 Map<String,Map<String,List<String>>> map = response.getHighlighting();
			System.out.println("total hits:"+list.getNumFound());
			for(SolrDocument doc : list){
				System.out.println("id:"+doc.getFieldValue("id"));
				System.out.println("name:"+doc.getFieldValue("name"));
				System.out.println("content:"+doc.get("content"));
				System.out.println("age:"+doc.getFieldValue("age"));
				System.out.println("testik:"+doc.get("testik"));
				System.out.println("hl:"+map.get(doc.getFieldValue("id")).get("testik").get(0));
				//修改doc方法 document.setField(高亮域, 高亮的值);
				doc.setField("testik", map.get(doc.getFieldValue("id")).get("testik").get(0));
				System.out.println("hl testik:"+doc.get("testik"));
				System.out.println();
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
	
	//删除
	public static void solrDelIndex(){
		try {
			String url = "http://localhost:8080/solr";
			HttpSolrServer server = new HttpSolrServer(url);
			server.deleteById("1");
			server.commit();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

5.2、

索引前，准备数据，把action字段为delete的索引、数据库记录都删除，把action字段为update的索引删除，action字段改为add重新索引。

索引后，更新数据库flag字段，索引时间字段。

六、配置文件

schema.xml

id类型：
//multiValued="true" 为多个域时候使用，跟复制域<copyField source="name" dest="name_content" /> 配合使用
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="pname" type="text_chinese" indexed="true"  stored="true" />

//sortMissingLast=”true”：没有该field的数据排在有该field的数据之后，而不管请求时的排序规则。
<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
//动态域 以备扩展之用。在索引文档时，一个字段如果在常规字段中没有匹配时，将到动态字段中匹配。
<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
<dynamicField name="*_is" type="int"    indexed="true"  stored="true"  multiValued="true"/>

//中文分词，ik
   <fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true" dicPath="ext.dic"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <!-- in this example, we will only use synonyms at query time

        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>

        -->
        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

一个analyzer chain中必须且只能包含一个tokenizer。tokenizer的作用是将输入的text stream (字符串，即field中存储的值)分解若干token。而filter的输入是token流，产出也是token流，因此多个filter可以组成一个filter chain。filter主要针对输入的token流进行处理，如做stem，去除stopwords等。solr本身已经提供了大量的tokenizer和filter，同时这种机制也让我们可以方便的进行定制。

solr中默认包含的analyzer,tokenizer,filter: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

如果没有指定analyzer的type，则表明index与query阶段用的是同样的analyzer。

七、HttpSolrServer初始化

SysConsts.SOLR_SERVER = new HttpSolrServer(SysConsts.SOLR_URL);
SysConsts.SOLR_SERVER.setSoTimeout(SysConsts.SOLR_TIME_OUT); // socket read timeout
SysConsts.SOLR_SERVER.setConnectionTimeout(SysConsts.SOLR_TIME_OUT);
SysConsts.SOLR_SERVER.setDefaultMaxConnectionsPerHost(100);
SysConsts.SOLR_SERVER.setMaxTotalConnections(100);
SysConsts.SOLR_SERVER.setFollowRedirects(false); // defaults to
SysConsts.SOLR_SERVER.setAllowCompression(true);
SysConsts.SOLR_SERVER.setMaxRetries(2);

八、查询方法

闭区间 []，开区间 {}： TO必须大写

prodlineid:13100 and indexnumber:[1497 TO 1499}

八、异常

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected content type application/octet-stream

解决：链接地址 url写错了。