使用htmlparser爬取一些页面时候(例如http://bbs.pcpop.com/O71228/1286458.html),会抛出org.htmlparser.util.EncodingChangeException异常:
例如执行如下代码(junit代码):
public void testLinkTag() {
try {
NodeFilter filter = new NodeClassFilter(LinkTag.class);
Parser parser = new Parser();
parser.setURL(“http://bbs.pcpop.com/O71228/1286458.html”);
parser.setEncoding(parser.getEncoding());
logger.fatal(“Encoding is “+parser.getEncoding());
NodeList list = parser.extractAllNodesThatMatch(filter);
for (int i = 0; i < list.size(); i++) {
LinkTag node = (LinkTag) list.elementAt(i);
logger.fatal(“testLinkTag() Link is :” + node.extractLink());
}
} catch (Exception e) {
e.printStackTrace();
}
}
会抛出如下异常
org.htmlparser.util.EncodingChangeException: character mismatch (new: 涓 [0x6d93] != old: [0x4e2d中]) for encoding change from UTF-8 to GB2312 at character offset 158
at org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.java:280)
at org.htmlparser.lexer.Page.setEncoding(Page.java:865)
at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:150)
at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69)
at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:160)
at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)
at org.htmlparser.Parser.visitAllNodesWith(Parser.java:726)
at ParserTestCase1.testImageVisitor(ParserTestCase1.java:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
分析此类型的页面可以知道,主要原因还是org.htmlparser.tags.MetaTag对页面缺省Encoding的处理存在问题
对于页面http://bbs.pcpop.com/O71228/1286458.html,其页面缺省的编码为gb2312
<META http-equiv="Content-Type" content="text/html; charset=gb2312">
但在服务器的Respone中是utf-8编码,因此浏览器是按照utf-8来编码。
HTTP/1.x 200 OK
Date: Thu, 19 Jun 2008 03:16:53 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Length: 130386
但在htmlparser中,即使调用parser.setEncoding(parser.getEncoding())后,在MetaTag处理时候,没有沿用Parser设定的encoding
修改如下:
public void doSemanticAction ()
throws
ParserException
{
String httpEquiv;
String charset;
httpEquiv = getHttpEquiv ();
if (“Content-Type”.equalsIgnoreCase (httpEquiv)){
//charset = getPage ().getCharset (getAttribute (“CONTENT”));
//getPage ().setEncoding (charset);
if (Page.DEFAULT_CHARSET == getPage ().getEncoding ()){
charset = getPage ().getCharset (getAttribute (“CONTENT”));
getPage ().setEncoding (charset);
}
}
}
转载请注明:出家如初,成佛有余 » htmlparser encoding 问题