1. 首页
  2. IT资讯

基于搜索词做的推荐

环境:Oracle database 11g, Gensim, jieba, spark 1.0 思路: 首先从数据仓库中抽取出每个人对应的搜索词集合, 然后对搜索词集合做分词处理,统计每个词的频率。 然后输出用户与分词处理后的词语的矩阵,其中搜索次数为矩阵中的数值。 步骤: 1. 在oracle数据库查出每个的搜索词集合 select employee_id, to_char(yd_concat(q_content)) from agg_kw_daily group by employee_id; 2. 分词处理,输出用户与分词处理后的词语的矩阵 from gensim import corpora import jieba train_set = [] q_content = [i.split(‘ ‘) for i in open(‘/u01/jerry/emp_query_conten.txt’).readlines()] [train_set.append(list(jieba.cut(i[1]))) for i in q_content] train_set2 = [] for i in train_set: train_set2.append([j for j in i if j not in set([u’,’, u’_’, u’-‘, u’ ‘, u’.’, u”, u’不’, u’的’])]) dic = corpora.Dictionary(train_set2) corpus = [dic.doc2bow(text) for text in train_set2] corpus2 = [] for i in corpus: corpus2.append([j for j in i if j[1] > 1]) import sys reload(sys) sys.setdefaultencoding(‘utf-8’) output = open(‘/u01/jerry/qw_dic’, ‘w’) for key, value in dic.iteritems(): output.write(str(key) + ‘ ‘ + value + ‘n’) for i in range(0, len(corpus2)): for j in corpus2[i]: print q_content[i][0], j[0], j[1] output = open(‘/u01/jerry/emp_q_cnt’, ‘w’) for i in range(0, len(corpus2)): for j in corpus2[i]: output.write(str(q_content[i][0]) + ‘ ‘ + str(j[0]) + ‘ ‘ + str(j[1]) + ‘n’) 3. 将输出的文件emp_q_cnt在spark mllib中计算,得出预测模型 import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating val data = sc.textFile(“/home/cloudera/emp_q_cnt”) val ratings = data.map(_.split(‘t’) match { case Array(user,item,rate) => Rating(user.toInt, item.toInt, rate.toDouble)}) val rank = 10 val numIterations = 1000 val model = ALS.train(ratings, rank, numIterations, 0.01) 4. 查看某个用户对某一分词的预测值(用户10008, 分词2) model.predict(sc.parallelize(Array((10008, 2)))).map{case Rating(user, item, rate) => ((user, item), rate)}.take(1)

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/16582684/viewspace-1258879/,如需转载,请注明出处,否则将追究法律责任。

主题测试文章,只做测试使用。发布者:布吉卡,转转请注明出处:http://www.cxybcw.com/193361.html

联系我们

13687733322

在线咨询:点击这里给我发消息

邮件:1877088071@qq.com

工作时间:周一至周五,9:30-18:30,节假日休息

QR code