Brool brool (n.) : a low roar; a deep murmur or humming

Beaujiful Soup

 |  beautiful coding parsing soup html clojure

Horrible name, isn’t it?

Beautiful Soup is a really nice Python library for extracting content from possibly-sloppy HTML, and I wanted some reasonably close Clojure equivalent. Unfortunately, the standard classes don’t work well with malformed HTML; as an example:

    => (require '(clojure [xml :as xml]))
    => (xml/parse "http://www.google.com")
    org.xml.sax.SAXParseException: The markup in the document preceding the root element must be well-formed. (NO_SOURCE_FILE:0)

Fortunately, there is already a TagSoup library that can parse non-perfect HTML, and it is very easy to integrate TagSoup into xml/parse. This module hardly does anything; it simply adds a few helper routines and brings the most-used calls into one amazingly bad namespace name.

Examples

Building your soup:

    (use beaujiful-soup.core)
    
    ; build soup from URL
    (def t (build-soup "http://www.google.com"))

    ; build soup from (deliberately malformed) string
    (def t2 (build-string-soup "OneTwo"))

Extracting information is done with the xml-> call. Oftentimes the last thing you do will be a node or text or (attr :attribute) call, in order to convert the results into a more workable type:

    ; you can "walk" down the tree with successive tag names.  For
    ; example, get every list item inside the unordered list
    ; immediately inside the body.
    (xml-> t2 :body :ul :li node)
    ; => ({:tag :li, :attrs nil, :content ["One"]} {:tag :li, :attrs nil, :content ["Two"]})

    ; get the text for the list items
    (xml-> t2 :body :ul :li text)
    ; => ("One" "Two")

    ; Get textareas immediately inside the body.
    (xml-> t :body :textarea node)
    ; => ({:tag :textarea, :attrs {:id "csi", :style "display:none"}, :content nil})

    ; use descendants to iterate through all nodes, not just the immediate children.
    ; Get the text from all  tags anywhere in the body.
    (xml-> t descendants :a text)
    ; => ("Images" "Videos" "Maps" ...)

    ;  Get the href attribute from all tags
    (xml-> t descendants :a (attr :href))
    ; => ("http://www.google.com/imghp?hl=en&tab=wi" ... )

Use the (attr=) predicate to match an attribute value:

    ; find invisible stuff
    (xml-> t2 descendants (attr= :style "display:none") tag)
    ; => (:textarea :iframe)    

Strings match the text inside nodes:

    ; find the link for the  that has "Videos" for content
    (xml-> t descendants :a "Videos" (attr :href))
    ; => ("http://video.google.com/?hl=en&tab=wv")

Arbitrary predicates can be used as well. They will take a loc (location), and are usually converted to a node before being used:

    ; find any :p or :div
    (defn p-or-div [loc] (contains? #{:p :div} (:tag (node loc))))
    (xml-> t descendants p-or-div tag)
    ; => (:div :div :div :div :div :div :div :div :div :div :div :p :div :div)

    ; find the link for  that has case-insensitive "Videos" for content
    (require 'clojure.string)
    (defn f [loc] 
      (let [n (node loc)]
       (and (= (:tag n) :a) (= (clojure.string/upper-case (first (:content n))) "VIDEOS"))))
    (xml-> t descendants f (attr :href))
    ; => ("http://video.google.com/?hl=en&tab=wv")

Fundamentally, the xml-> call returns a list of locations, and you can apply arbitrary transforms as necessary. For example, let’s say that you want to build a map of text => hrefs for all of the links:

    (defn loc-to-pair [loc]
        [ (attr loc :href), (text loc) ])
    (apply hash-map (xml-> t descendants :a loc-to-pair))
    ; => {"/services/" "Business Solutions",  ... }

Having a vector in the chain applies all the predicates within the vector, and filters out anything that doesn’t match. It acts a little like a lookahead in a regex. For example:

    ; Find the IDs of all divs that contain an href immediately within them
    (xml-> t descendants :div [ :a ] (attr :id))
    ; => ("fll")

    ; Find the IDs of all divs that contains an href anywhere within them
    (xml-> t descendants :div [ descendants :a ] (attr :id))
    ; => ("ghead" "gbar" "guser" "fll")

Source

It’s all on Github.

Discussion

Comments are moderated whenever I remember that I have a blog.

mch | 2010-11-13 17:12:51
Have you looked at enlive? https://github.com/cgrand/enlive It also uses tagsoup to parse html.
Reply
Add a comment