FLY的狐狸

2011-12-30

系统管理员

Lucene3.4入门示例

评论(0) 浏览量(12023)

原创作品，允许转载，转载时请务必以超链接形式标明文章原始出处、作者信息和本声明。否则将追究法律责任。http://enetq.blog.51cto.com/479739/697704

Lucene3.4 下载地址：http://lucene.apache.org/ 14 September 2011

简介如下：(官网简介:)

What Is Apache Lucene?
The Apache Lucene™ project develops open-source search software, including:
Apache Lucene Core™ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.
Apache Solr™ is our high performance enterprise search server, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, distributed search, database integration, web admin and search interfaces.
Apache PyLucene™ is a Python port of the the Lucene Core project.
Apache Open Relevance Project™ is a subproject with the aim of collecting and distributing free materials for relevance testing and performance.

★示例：本示例要实现的功能是:查找txt文本文档中的关键字，如果找到，则显示匹配结果，并输出文件名、存放路径、大小、内容.

★原理：采集建立索引，从信息源中拷贝到本地进行加工处理,这里的信息源可以是数据库、互联网等,存入索引库(一组文件的集合,二进制).搜索时从本地的信息集合中进行搜索.文本信息在建立索引和搜索时，都会使用到分词器进行分词,并且使用的是同一个分词器.索引库可以理解为包含索引表和索引表对应的数据、文档等的集合.搜索时，分词器对关键字进行处理，比照索引表,通过索引表找到数据。

★示例实战：

建立测试hello.txt文件内容如下：

hello1 world test for fd. document document
Just a case; hel
hello是测试测试搜索 1 hrllo hello hello hello

1.建立一个Java Project

2.导入Lucene3.4 必须jar包

lucene-core-3.4.0.jar//核心jar包

contrib\highlighter\lucene-highlighter-3.4.0.jar //高亮

contrib\analyzers\lucene-analyzers-3.4.0.jar //分词器

新建数据源(本地)文件夹luceneDataSource,索引文件夹luceneIndex

3.LuceneDemo.java源代码:

import java.io.File;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.queryParser.MultiFieldQueryParser;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
import com.yaxing.utils.File2Document;
public class LuceneDemo {
String filePath = "J:\\MyEclipse-8.6\\lucene\\LuceneDemo\\luceneDataSource\\hello.txt";
File indexPath = new File("J:\\MyEclipse-8.6\\lucene\\LuceneDemo\\luceneIndex");
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
@Test
public void creatIndex() throws Exception {
// file-->Document
Document doc = File2Document.file2Document(filePath);
//Directory dir = FSDirectory.open(indexPath);
IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexPath), analyzer, true,MaxFieldLength.LIMITED);
indexWriter.addDocument(doc);
indexWriter.close();
}
@Test
public void search() throws Exception {
String queryString = "搜索";
//把要搜索的文本解析为Query
String[] fields = {"name","content"};
QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_34, fields, analyzer); //查询解析器
Query query = queryParser.parse(queryString);
//查询
IndexSearcher indexSearcher = new IndexSearcher(FSDirectory.open(indexPath));
Filter filter = null;
TopDocs topDocs = indexSearcher.search(query, filter, 10000);//topDocs 类似集合
System.out.println("总共有【"+topDocs.totalHits+"】条匹配结果.");
//输出
for(ScoreDoc scoreDoc:topDocs.scoreDocs){
int docSn = scoreDoc.doc;//文档内部编号
Document doc = indexSearcher.doc(docSn);//根据文档编号取出相应的文档
File2Document.printDocumentInfo(doc);//打印出文档信息
}
}
}

4.File2Document.java源码

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
public class File2Document {
//文件属性： content,name,size,path
public static Document file2Document(String path){
File file = new File(path);
Document doc = new Document();
//Store.YES 是否存储 yes no compress
//Index 是否进行索引 Index.ANALYZED 分词后进行索引
doc.add(new Field("name",file.getName(),Store.YES,Index.ANALYZED));
doc.add(new Field("content",readFileContent(file),Store.YES,Index.ANALYZED));//readFileContent()读取文件类容
doc.add(new Field("size",String.valueOf(file.length()),Store.YES,Index.NOT_ANALYZED));//不分词,文件大小(int)转换成String
doc.add(new Field("path",file.getAbsolutePath(),Store.YES,Index.NOT_ANALYZED));//不需要根据文件的路径来查询
return doc;
}
private static String readFileContent(File file) {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
StringBuffer content = new StringBuffer();
try {
for(String line=null;(line = reader.readLine())!=null;){
content.append(line).append("\n");
}
} catch (IOException e) {
e.printStackTrace();
}
return content.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
return null;
}
public static void printDocumentInfo(Document doc){
System.out.println("name -->"+doc.get("name"));
System.out.println("content -->"+doc.get("content"));
System.out.println("path -->"+doc.get("path"));
System.out.println("size -->"+doc.get("size"));
}
}

5.Junit测试结果：

String queryString = "搜索";

总共有【1】条匹配结果.
name -->hello.txt
content -->hello1 world test for fd. document document
Just a case; hel
hello是测试测试搜索 1 hrllo hello hello hello
path -->J:\MyEclipse-8.6\lucene\LuceneDemo\luceneDataSource\hello.txt
size -->109

String queryString = "hello";

总共有【1】条匹配结果.
name -->hello.txt
content -->hello1 world test for fd. document document
Just a case; hel
hello是测试测试搜索 1 hrllo hello hello hello
path -->J:\MyEclipse-8.6\lucene\LuceneDemo\luceneDataSource\hello.txt
size -->109

索引建立如下：

String queryString = "zazazaza";

总共有【0】条匹配结果.

本文出自 “幽灵柯南的技术blog” 博客，请务必保留此出处http://enetq.blog.51cto.com/479739/697704

分享到：

没有登录不能评论