`
单眼皮大娘
  • 浏览: 110889 次
  • 性别: Icon_minigender_2
  • 来自: 上海
社区版块
存档分类
最新评论

paoding基于词典如何分词

阅读更多
    上次介绍了Paoding的字典数据结构,这次介绍下paoding是如何对待分词文本依据词典分词的。paoding在查找字典时依据两个类: BinaryDictionary 和 HashBinaryDictionary。上次也已经介绍过这两个数据结构,这里不再重复。

    HashBinaryDictionary其实就是把大块数据词典切分成小块的词典,并用BinaryDictionary存储。在用HashBinaryDictionary的search方法查找时,其采用的是递归方法,最终还是会进入BinaryDictionary中。看代码便知道
public Hit search(CharSequence input, int begin, int count) {
		SubDictionaryWrap subDic = (SubDictionaryWrap) subs.get(keyOf(input
				.charAt(hashIndex + begin)));
		if (subDic == null) {
			return Hit.UNDEFINED;
		}
		Dictionary dic = subDic.dic;
		// 对count==hashIndex + 1的处理
		if (count == hashIndex + 1) {
			Word header = dic.get(0);
			if (header.length() == hashIndex + 1) {
				if (subDic.wordIndexOffset + 1 < this.ascWords.length) {
					return new Hit(subDic.wordIndexOffset, header,
							this.ascWords[subDic.wordIndexOffset + 1]);
				} else {
					return new Hit(subDic.wordIndexOffset, header, null);
				}
			} else {
				return new Hit(Hit.UNCLOSED_INDEX, null, header);
			}
		}
		// count > hashIndex + 1
		Hit word = dic.search(input, begin, count);
		if (word.isHit()) {
			int index = subDic.wordIndexOffset + word.getIndex();
			word.setIndex(index);
			if (word.getNext() == null && index < size()) {
				word.setNext(get(index + 1));
			}
		}
		return word;
	}


可以看出确实是递归调用Search方法(在构建该结构的时候也是采用递归构造)。
而在BinaryDictionary的search方法中,采用的是二分查找(一会解释为何使用二分查找方法)。代码如下:
public Hit search(CharSequence input, int begin, int count) {
		int left = this.start;
		int right = this.end - 1;
		int pointer = 0;
		Word word = null;
		int relation;
		//
		while (left <= right) {
			pointer = (left + right) >> 1;
			word = ascWords[pointer];
			relation = compare(input, begin, count, word);
			if (relation == 0) {
				// System.out.println(new String(input,begin, count)+"***" +
				// word);
				int nextWordIndex = pointer + 1;
				if (nextWordIndex >= ascWords.length) {
					return new Hit(pointer, word, null);
				}
				else {
					return new Hit(pointer, word, ascWords[nextWordIndex]);
				}
			}
			if (relation < 0)
				right = pointer - 1;
			else
				left = pointer + 1;
		}
		//
		if (left >= ascWords.length) {
			return Hit.UNDEFINED;
		}
		//
		boolean asPrex = true;
		Word nextWord = ascWords[left];
		if (nextWord.length() < count) {
			asPrex = false;
		}
		for (int i = begin, j = 0; asPrex && j < count; i++, j++) {
			if (input.charAt(i) != nextWord.charAt(j)) {
				asPrex = false;
			}
		}
		return asPrex ? new Hit(Hit.UNCLOSED_INDEX, null, nextWord) : Hit.UNDEFINED;
	}


这里很佩服Paoding的作者,这样的结构绝不逊色于中科院字典的数据结构。中科院的字典是人为排好序的,而这里paoding的字典没有排好序列为何还能跟中科院的查找方法类似,采用二分查找法?

主要是其在构造字典数据时调用了
Arrays.sort(array);
这个方法,sort的解释如下:
Sorts the specified array of objects into ascending order, according to the natural ordering of its elements. All elements in the array must implement the Comparable interface. Furthermore, all elements in the array must be mutually comparable (that is, e1.compareTo(e2) must not throw a ClassCastException for any elements e1 and e2 in the array).

This sort is guaranteed to be stable: equal elements will not be reordered as a result of the sort.

The sorting algorithm is a modified mergesort (in which the merge is omitted if the highest element in the low sublist is less than the lowest element in the high sublist). This algorithm offers guaranteed n*log(n) performance.


为解除疑惑,也做了相应的测试:
public class TestArraySort {
	public static void main(String[] args) {
		HashSet<String> set = new HashSet<String>();
		set.add("三心二意");
		set.add("五谷丰登");
		set.add("六六大顺");
		set.add("三个人");
		set.add("五个人");
		set.add("六个人");
		Object[] array = set.toArray();
		Arrays.sort(array);
		for (int i = 0; i < array.length; i++) {
			System.out.println(array[i]);
		}
	}
}

结果如下:

三个人
三心二意
五个人
五谷丰登
六个人
六六大顺

public class TestCharactor {
	public static void main(String[] args)  {
		int c1 = '三';
		int c2 = '五';
		
		System.out.println("The category of c1 is: " + c1);
		System.out.println("The category of c2 is: " + c2);
	}
}

结果如下:

The category of c1 is: 19977
The category of c2 is: 20116

public class TestCharactor {
	public static void main(String[] args)  {
		int c1 = '个';
		int c2 = '心';
		
		System.out.println("The category of c1 is: " + c1);
		System.out.println("The category of c2 is: " + c2);
	}
}


The category of c1 is: 20010
The category of c2 is: 24515


这个充分说明了sort确实是把词典中的词按照汉字对应的值按升序排好序列,这个为二分查找埋下伏笔。

    paoding在进行切分时采用的双重循环方式,这就可以切分尽可能多的词,外层循环相当于一个游标,逐个扫面带切分的文本,内层循环尽可能的切词,把切出的词丢到一个专门接受词的类中,这就是所谓的细粒度切分。
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics