In preparation for my PyCon talk on HTML I thought I’d do a performance comparison of several parsers and document models.

The situation is a little complex because there’s different steps in handling HTML:

  1. Parse the HTML
  2. Parse it into something (a document object)
  3. Serialize it

Some libraries handle 1, some handle 2, some handle 1, 更多相关的内容 »
comments 讨论   addto 把此链接加入于...  recommend 与朋友分享   report 已已沉

The situation is a little complex because there’s different steps in handling HTML:

  1. Parse the HTML
  2. Parse it into something (a document object)
  3. Serialize it

Some libraries handle 1, some handle 2, some handle 1, ">submit 'Python HTML Parser Performance' to digg   submit 'Python HTML Parser Performance' to reddit   submit 'Python HTML Parser Performance' to Pligg   submit 'Python HTML Parser Performance' to yahoo   |   书签  

Python programming language, created in 1991 by Guido van Rossum, is the sixth most popular programming language and “the programming language of year 2007“. Nokia 更多相关的内容 »
comments 讨论   addto 把此链接加入于...  recommend 与朋友分享   report 已已沉

Python 自1.5版本起增加了re 模块,它提供 Perl 风格的正则表达式模式。Python 1.5之前版本则是通过 regex 模块提供 Emecs 风格的模式。Emacs 风格模式可读性稍差些,而且功能也不强,因此编写新代码时尽量不要再使用 regex 模块,当然偶尔你还是可能在老代码里发现其踪影。


就其本质而言,正则表达式(或 RE)是一种小型的、高度专业化的编程语言,(在Python中)它内嵌在Python中,并通过 re 模块实现。使用这个小型语言,你可以为想要匹配的相应字符串集指定规则;该字符串集可能包含英文语句、e-mail地址、TeX命令或任何你想搞定的东 西。然後你可以问诸如“这个字符串匹配该模式吗?”或“在这个字符串中是否有部分匹配该模式呢?”。你也可以使用 RE 以各种方式来修改或分割字符串。


正则表达式模式被编译成一系列的字节码,然後由用 C 编写的匹配引擎执行。在高级用法中,也许还要仔细留意引擎是如何执行给定 RE ,如何以特定方式编写 RE 以 更多相关的内容 »
comments 讨论   addto 把此链接加入于...  recommend 与朋友分享   report 已已沉

In a previous entry, I showed how to extend Python without directly using the Python C using SWIG. SWIG parses an interface file written by the programmer and then outputs a wrapper module. A more elegant, more pythonic solution might be a parser which allows a C module to be written directly in Python. 更多相关的内容 »
comments 讨论   addto 把此链接加入于...  recommend 与朋友分享   report 已已沉