jiebaR分词包

jiebaR分词包

R语言中文社区推文学习笔记

脚本和数据集下载👇
jiebaR分词包.R
dictionary.txt
stopwords.txt

分词函数worker

worker(type = “mix”, dict = DICTPATH, hmm = HMMPATH, user = USERPATH, idf = IDFPATH, stop_word = STOPPATH, write = T, qmax = 20, topn = 5, encoding = “UTF-8”, detect = T, symbol = F, lines = 1e+05, output = NULL, bylines = F, user_weight = “max”)

参数 作用
type 指定分词引擎的类型,这个包包括mix,mp,hmm,full,query,tag,simhash,分别指混合模型、支持最大概率、隐马尔可夫模型、全模式、索引模型、词性标注,文本Simhash相似度比较、关键词提取。
dict 词库路径,默认为DICTPATH
hmm 用来指定隐马尔可夫模型的路径,默认值为DICTPATH,当然也可以指定其他分词引擎
user 用户自定义的词库
idf 用来指定逆文本频率指数路径,默认为DICTPATH,也可以为simhash和keyword分词引擎
stop_word 用来指定停用词的路径
qmax 词的最大查询长度,默认为20,可用于query分词类型。
topn 关键词的个数,默认为5,可用于simhash和keyword分词类型。
symbol 输出是否保留符号,默认为F
Lines 从文件中最大因此读取的行数,默认为1e+05
output 输出文件,文件名一般时候为系统时间
bylines 返回输入的文件有多少行
user_weight 用户词典的权重,有“min”、“max” or “median” 三个选项。

另外一个函数是segment, 他就好比老板,它有三个参数,code就好比任务,jiebar就是一个worker,但是担心worker对工作的方法不懂,那就用mod参数告诉worker怎么做,也就是用分词引擎分词。作用如下。它要用这个工人worker分词。

参数 作用
code 中文句子或文件
jiebar 设置分词引擎,也就是worker函数
mod 改变默认的分词引擎类型,其中包括以下几个:“mix”、“hmm”、“query”、“full”、“level”、“mp”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
> # jiebaR分词包
> ## 分词函数worker
> ## worker(type = "mix", dict = DICTPATH, hmm = HMMPATHH, user = USERPATH, idf = IDFPATH, stop_word = STOPPATH, write = T, qmax = 20, topn = 5, encoding = "UTF-8", detect = T, symbol = F, lines = 1e+05, output = NULL, bylines = F, user_weight = "max")
> setwd("~/Desktop")
> library(jiebaR)
> engine <- worker()
> words <- "在连续发生的多种原因中,属于保险责任的原因在前,属于除外责任的原因在后,前后者又有因果关系,则保险人就要承担赔偿责任。公众号,R语言" #注意公众号和R语言被分为了两个词
> segment(words, engine)
[1] "在" "连续" "发生" "的" "多种"
[6] "原因" "中" "属于" "保险" "责任"
[11] "的" "原因" "在" "前" "属于"
[16] "除外" "责任" "的" "原因" "在"
[21] "后" "前" "后者" "又" "有"
[26] "因果关系" "则" "保险人" "就要" "承担"
[31] "赔偿" "责任" "公众" "号" "R"
[36] "语言"
> engine <= words #与上面的命令等价
[1] "在" "连续" "发生" "的" "多种"
[6] "原因" "中" "属于" "保险" "责任"
[11] "的" "原因" "在" "前" "属于"
[16] "除外" "责任" "的" "原因" "在"
[21] "后" "前" "后者" "又" "有"
[26] "因果关系" "则" "保险人" "就要" "承担"
[31] "赔偿" "责任" "公众" "号" "R"
[36] "语言"
> #添加用户自定义词或词库
> ##方法一:new_user_word()函数:
> engine_new_word <- worker()
> new_user_word(engine_new_word, c("公众号", "R语言"))
[1] TRUE
> segment(words, engine_new_word)
[1] "在" "连续" "发生" "的" "多种"
[6] "原因" "中" "属于" "保险" "责任"
[11] "的" "原因" "在" "前" "属于"
[16] "除外" "责任" "的" "原因" "在"
[21] "后" "前" "后者" "又" "有"
[26] "因果关系" "则" "保险人" "就要" "承担"
[31] "赔偿" "责任" "公众号" "R语言"
> ##方法二:user参数添加词库(先用txt建立一个词典文件):
> engine_user <- worker(user = 'dictionary.txt')
> segment(words, engine_user)
[1] "在" "连续" "发生" "的" "多种"
[6] "原因" "中" "属于" "保险" "责任"
[11] "的" "原因" "在" "前" "属于"
[16] "除外" "责任" "的" "原因" "在"
[21] "后" "前" "后者" "又" "有"
[26] "因果关系" "则" "保险人" "就要" "承担"
[31] "赔偿" "责任" "公众号" "R语言"
> ##在使用词库的胡哈也可以使用new_user_word()函数:
> new_user_word(engine_new_word, scan("dictionary.txt", what = "",sep = "\n"))
Read 2 items
[1] TRUE
> segment(words, engine_new_word)
[1] "在" "连续" "发生" "的" "多种"
[6] "原因" "中" "属于" "保险" "责任"
[11] "的" "原因" "在" "前" "属于"
[16] "除外" "责任" "的" "原因" "在"
[21] "后" "前" "后者" "又" "有"
[26] "因果关系" "则" "保险人" "就要" "承担"
[31] "赔偿" "责任" "公众号" "R语言"
> ##注意:
> #词库的第一行一定要空着,都则第一个词可能会消失;
> ##删除停用词
> engine_s <- worker(stop_word = "stopwords.txt")
> segment(words, engine_s)
[1] "连续" "发生" "多种" "原因" "属于"
[6] "保险" "责任" "原因" "属于" "除外"
[11] "责任" "原因" "后者" "因果关系" "保险人"
[16] "就要" "承担" "赔偿" "责任" "公众"
[21] "号" "R" "语言"
> engine_s <- worker(stop_word = "stopwords.txt", user = "dictionary.txt")
> segment(words, engine_s)
[1] "连续" "发生" "多种" "原因" "属于"
[6] "保险" "责任" "原因" "属于" "除外"
[11] "责任" "原因" "后者" "因果关系" "保险人"
[16] "就要" "承担" "赔偿" "责任" "公众号"
[21] "R语言"
> ##统计词频
> freq(segment(words, engine_s))
char freq
1 R语言 1
2 公众号 1
3 承担 1
4 就要 1
5 连续 1
6 因果关系 1
7 保险 1
8 原因 3
9 赔偿 1
10 后者 1
11 责任 3
12 属于 2
13 发生 1
14 保险人 1
15 除外 1
16 多种 1
> ##词性标注(在分词的时候会加上词性)
> #方法一:好像已经失效了
> qseg[words]
[1] "在" "连续" "发生" "的" "多种"
[6] "原因" "中" "属于" "保险" "责任"
[11] "的" "原因" "在" "前" "属于"
[16] "除外" "责任" "的" "原因" "在"
[21] "后" "前" "后者" "又" "有"
[26] "因果关系" "则" "保险人" "就要" "承担"
[31] "赔偿" "责任" "公众" "号" "R"
[36] "语言"
> qseg <= words
[1] "在" "连续" "发生" "的" "多种"
[6] "原因" "中" "属于" "保险" "责任"
[11] "的" "原因" "在" "前" "属于"
[16] "除外" "责任" "的" "原因" "在"
[21] "后" "前" "后者" "又" "有"
[26] "因果关系" "则" "保险人" "就要" "承担"
[31] "赔偿" "责任" "公众" "号" "R"
[36] "语言"
> #方法二:
> tagger <- worker(type = "tag")
> tagger <= words
p a v uj m
"在" "连续" "发生" "的" "多种"
n f v n n
"原因" "中" "属于" "保险" "责任"
uj n p f v
"的" "原因" "在" "前" "属于"
c n uj n p
"除外" "责任" "的" "原因" "在"
f f n d v
"后" "前" "后者" "又" "有"
n d n d v
"因果关系" "则" "保险人" "就要" "承担"
v n n m x
"赔偿" "责任" "公众" "号" "R"
n
"语言"
  • 汉语词性标注:
标注 词性 标注 词性
ag 形容素 a 形容词
ad 副形词 an 名形词
b 区别词 c 连词
Dg 副语素 d 副词
e 叹词 f 方位词
g 语素 h 前接成分
i 成语 j 简称略语
k 后接成分 l 习用语
m 数词 ng 名词素
n 名词 nr 人名
nrl 汉语姓氏 nr2 日语姓氏
nrj 日语人名 nrf 音译人名
ns 地名 nt 机构团体
nz 其他专名 nl 名词性惯用语
vf 取向动词 vx 形式动词
p 介词 q 量词
r 代词 s 处所词
tg 时间词性语素 t 时间词
u 助词 vg 动词素
v 动词 vd 副动词
vn 名动词 w 标点符号
y 语气词 x 非语素词
z 状态词 o 拟声词
# R

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×