原创作品,允许转载,转载时请务必以超链接形式标明文章 原始出处 、作者信息和本声明。否则将追究法律责任。http://enetq.blog.51cto.com/479739/697704
Lucene3.4 下载地址:http://lucene.apache.org/ 14 September 2011
简介如下:(官网简介:)
- What Is Apache Lucene?
- The Apache Lucene™ project develops open-source search software, including:
-
- Apache Lucene Core™ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.
- Apache Solr™ is our high performance enterprise search server, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, distributed search, database integration, web admin and search interfaces.
- Apache PyLucene™ is a Python port of the the Lucene Core project.
- Apache Open Relevance Project™ is a subproject with the aim of collecting and distributing free materials for relevance testing and performance.
★示例:本示例要实现的功能是:查找txt文本文档中的关键字,如果找到,则显示匹配结果,并输出文件名、存放路径、大小、内容.
★原理:采集建立索引,从信息源中拷贝到本地进行加工处理,这里的信息源可以是数据库、互联网等,存入索引库(一组文件的集合,二进制).搜索时从本地的信息集合中进行搜索.文本信息在建立索引和搜索时,都会使用到分词器进行分词,并且使用的是同一个分词器.索引库可以理解为包含索引表和索引表对应的数据、文档等的集合.搜索时,分词器对关键字进行处理,比照索引表,通过索引表找到数据。
★示例实战:
建立测试hello.txt文件内容如下:
- hello1 world test for fd. document document
- Just a case; hel
- hello是 测试测试搜索 1 hrllo hello hello hello
1.建立一个Java Project
2.导入Lucene3.4 必须jar包
lucene-core-3.4.0.jar//核心jar包
contrib\highlighter\lucene-highlighter-3.4.0.jar //高亮
contrib\analyzers\lucene-analyzers-3.4.0.jar //分词器
新建数据源(本地)文件夹luceneDataSource,索引文件夹luceneIndex
3.LuceneDemo.java源代码:
- import java.io.File;
-
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.index.IndexWriter;
- import org.apache.lucene.index.IndexWriter.MaxFieldLength;
-
- import org.apache.lucene.queryParser.MultiFieldQueryParser;
- import org.apache.lucene.queryParser.QueryParser;
- import org.apache.lucene.search.Filter;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.ScoreDoc;
- import org.apache.lucene.search.TopDocs;
- import org.apache.lucene.store.FSDirectory;
- import org.apache.lucene.util.Version;
- import org.junit.Test;
-
- import com.yaxing.utils.File2Document;
-
- public class LuceneDemo {
- String filePath = "J:\\MyEclipse-8.6\\lucene\\LuceneDemo\\luceneDataSource\\hello.txt";
- File indexPath = new File("J:\\MyEclipse-8.6\\lucene\\LuceneDemo\\luceneIndex");
- Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
-
-
-
- @Test
- public void creatIndex() throws Exception {
- // file-->Document
- Document doc = File2Document.file2Document(filePath);
- //Directory dir = FSDirectory.open(indexPath);
- IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexPath), analyzer, true,MaxFieldLength.LIMITED);
- indexWriter.addDocument(doc);
- indexWriter.close();
-
- }
-
-
-
- @Test
- public void search() throws Exception {
- String queryString = "搜索";
- //把要搜索的文本解析为Query
- String[] fields = {"name","content"};
- QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_34, fields, analyzer); //查询解析器
- Query query = queryParser.parse(queryString);
- //查询
- IndexSearcher indexSearcher = new IndexSearcher(FSDirectory.open(indexPath));
- Filter filter = null;
- TopDocs topDocs = indexSearcher.search(query, filter, 10000);//topDocs 类似集合
- System.out.println("总共有【"+topDocs.totalHits+"】条匹配结果.");
- //输出
- for(ScoreDoc scoreDoc:topDocs.scoreDocs){
- int docSn = scoreDoc.doc;//文档内部编号
- Document doc = indexSearcher.doc(docSn);//根据文档编号取出相应的文档
- File2Document.printDocumentInfo(doc);//打印出文档信息
-
- }
-
- }
-
-
-
-
- }
4.File2Document.java源码
-
- import java.io.BufferedReader;
- import java.io.File;
- import java.io.FileInputStream;
- import java.io.FileNotFoundException;
- import java.io.IOException;
- import java.io.InputStreamReader;
-
- import org.apache.lucene.document.Document;
- import org.apache.lucene.document.Field;
- import org.apache.lucene.document.Field.Index;
- import org.apache.lucene.document.Field.Store;
-
- public class File2Document {
- //文件属性: content,name,size,path
- public static Document file2Document(String path){
- File file = new File(path);
- Document doc = new Document();
- //Store.YES 是否存储 yes no compress
- //Index 是否进行索引 Index.ANALYZED 分词后进行索引
- doc.add(new Field("name",file.getName(),Store.YES,Index.ANALYZED));
- doc.add(new Field("content",readFileContent(file),Store.YES,Index.ANALYZED));//readFileContent()读取文件类容
- doc.add(new Field("size",String.valueOf(file.length()),Store.YES,Index.NOT_ANALYZED));//不分词,文件大小(int)转换成String
- doc.add(new Field("path",file.getAbsolutePath(),Store.YES,Index.NOT_ANALYZED));//不需要根据文件的路径来查询
- return doc;
- }
-
-
- private static String readFileContent(File file) {
- try {
- BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
- StringBuffer content = new StringBuffer();
- try {
- for(String line=null;(line = reader.readLine())!=null;){
- content.append(line).append("\n");
- }
- } catch (IOException e) {
-
- e.printStackTrace();
- }
- return content.toString();
- } catch (FileNotFoundException e) {
-
- e.printStackTrace();
- }
- return null;
- }
-
- public static void printDocumentInfo(Document doc){
- System.out.println("name -->"+doc.get("name"));
- System.out.println("content -->"+doc.get("content"));
- System.out.println("path -->"+doc.get("path"));
- System.out.println("size -->"+doc.get("size"));
-
- }
-
- }
5.Junit测试结果:
String queryString = "搜索";
- 总共有【1】条匹配结果.
- name -->hello.txt
- content -->hello1 world test for fd. document document
- Just a case; hel
- hello是 测试测试搜索 1 hrllo hello hello hello
-
- path -->J:\MyEclipse-8.6\lucene\LuceneDemo\luceneDataSource\hello.txt
- size -->109
String queryString = "hello";
- 总共有【1】条匹配结果.
- name -->hello.txt
- content -->hello1 world test for fd. document document
- Just a case; hel
- hello是 测试测试搜索 1 hrllo hello hello hello
-
- path -->J:\MyEclipse-8.6\lucene\LuceneDemo\luceneDataSource\hello.txt
- size -->109
索引建立如下:

String queryString = "zazazaza";
- 总共有【0】条匹配结果.
本文出自 “幽灵柯南的技术blog” 博客,请务必保留此出处http://enetq.blog.51cto.com/479739/697704