Brool brool (n.) : a low roar; a deep murmur or humming

Beaujiful Soup

 |  beautiful coding parsing soup html clojure

Horrible name, isn’t it?

Beautiful Soup is a really nice Python library for extracting content from possibly-sloppy HTML, and I wanted some reasonably close Clojure equivalent. Unfortunately, the standard classes don’t work well with malformed HTML; as an example:

=> (require '(clojure [xml :as xml])) => (xml/parse "http://www.google.com") org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed. (NO_SOURCE_FILE:0)

Fortunately, there is already a TagSoup library that can parse non-perfect HTML, and it is very easy to integrate TagSoup into xml/parse. This module hardly does anything; it simply adds a few helper routines and brings the most-used calls into one amazingly bad namespace name.

Examples

Building your soup:

(use beaujiful-soup.core) ; build soup from URL (def t (build-soup "http://www.google.com")) ; build soup from (deliberately malformed) string (def t2 (build-string-soup "OneTwo"))

Extracting information is done with the xml-> call. Oftentimes the last thing you do will be a node or text or (attr :attribute) call, in order to convert the results into a more workable type:

; you can "walk" down the tree with successive tag names. For ; example, get every list item inside the unordered list ; immediately inside the body. (xml-> t2 :body :ul :li node) ; => ({:tag :li, :attrs nil, :content ["One"]} {:tag :li, :attrs nil, :content ["Two"]}) ; get the text for the list items (xml-> t2 :body :ul :li text) ; => ("One" "Two") ; Get textareas immediately inside the body. (xml-> t :body :textarea node) ; => ({:tag :textarea, :attrs {:id "csi", :style "display:none"}, :content nil}) ; use descendants to iterate through all nodes, not just the immediate children. ; Get the text from all tags anywhere in the body. (xml-> t descendants :a text) ; => ("Images" "Videos" "Maps" ...) ; Get the href attribute from all tags (xml-> t descendants :a (attr :href)) ; => ("http://www.google.com/imghp?hl=en&tab=wi" ... )

Use the (attr=) predicate to match an attribute value:

; find invisible stuff (xml-> t2 descendants (attr= :style "display:none") tag) ; => (:textarea :iframe)

Strings match the text inside nodes:

; find the link for the that has "Videos" for content (xml-> t descendants :a "Videos" (attr :href)) ; => ("http://video.google.com/?hl=en&tab=wv")

Arbitrary predicates can be used as well. They will take a loc (location), and are usually converted to a node before being used:

; find any :p or :div (defn p-or-div [loc] (contains? #{:p :div} (:tag (node loc)))) (xml-> t descendants p-or-div tag) ; => (:div :div :div :div :div :div :div :div :div :div :div :p :div :div) ; find the link for that has case-insensitive "Videos" for content (require 'clojure.string) (defn f [loc] (let [n (node loc)] (and (= (:tag n) :a) (= (clojure.string/upper-case (first (:content n))) "VIDEOS")))) (xml-> t descendants f (attr :href)) ; => ("http://video.google.com/?hl=en&tab=wv")

Fundamentally, the xml-> call returns a list of locations, and you can apply arbitrary transforms as necessary. For example, let’s say that you want to build a map of text => hrefs for all of the links:

(defn loc-to-pair [loc] [ (attr loc :href), (text loc) ]) (apply hash-map (xml-> t descendants :a loc-to-pair)) ; => {"/services/" "Business Solutions", ... }

Having a vector in the chain applies all the predicates within the vector, and filters out anything that doesn’t match. It acts a little like a lookahead in a regex. For example:

; Find the IDs of all divs that contain an href immediately within them (xml-> t descendants :div [ :a ] (attr :id)) ; => ("fll") ; Find the IDs of all divs that contains an href anywhere within them (xml-> t descendants :div [ descendants :a ] (attr :id)) ; => ("ghead" "gbar" "guser" "fll")

Source

It’s all on Github.

Discussion

Comments are moderated whenever I remember that I have a blog.

mch | 2010-11-13 17:12:51
Have you looked at enlive? https://github.com/cgrand/enlive It also uses tagsoup to parse html.
Reply
Add a comment