1. 首页
  2. IT资讯

R语言XML包readHTMLTable中文乱码

环境: Windows 7, Ubuntu 12, RStudio Desktop
问题: 使用安装在windows 7 上的RStudio desktop, 用包XML中的readHTMLTable读取网页上的

数据,例:
library(XML)
u = ‘http://tech.163.com/special/00094IGJ/top1000.html’
url= htmlParse(u, encoding=”GB2312″)
tables = readHTMLTable(url)
raw = tables[[6]]
查看raw中文显示乱码, 查看sessionInfo(),

R version 2.15.1 (2012-06-22)  Platform. x86_64-pc-mingw32/x64 (64-bit)    locale:  [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936   [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936     [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936  [4] LC_NUMERIC=C                                                     [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936        attached base packages:  [1] stats     graphics  grDevices utils     datasets  methods    [7] base         other attached packages:  [1] XML_3.95-0.1    loaded via a namespace (and not attached):  [1] tools_2.15.1
这个与操作相关, 可以尝试更改Sys.setlocale("LC_CTYPE", "UTF-8"),但报“操作系统报告说无法执行将本地化设成"UTF-8"的请求”。
在Ubuntu中使用RStudio却能正确显示中文,查看sessionInfo()
R version 2.14.1 (2011-12-22)  Platform. x86_64-pc-linux-gnu (64-bit)    locale:   [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C              [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C          [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C          [10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C     attached base packages:  [1] stats     graphics  grDevices utils     datasets  methods    [7] base         loaded via a namespace (and not attached):  [1] tools_2.14.1
造成的原因推测是XML包编码方式与操作系统的字符编码相关。 有高手知道的具体原因的请帮忙解答下。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/16582684/viewspace-753963/,如需转载,请注明出处,否则将追究法律责任。

主题测试文章,只做测试使用。发布者:布吉卡,转转请注明出处:http://www.cxybcw.com/193303.html

联系我们

13687733322

在线咨询:点击这里给我发消息

邮件:1877088071@qq.com

工作时间:周一至周五,9:30-18:30,节假日休息

QR code