brool

brool \brool\ (n.) : a low roar; a deep murmur or humming

Parsing Tables With Beautiful Soup

September 27th, 2008

Just a quick snippet, since it is obvious after writing it but was not obvious while searching for it:

html = file("whatever.html")
soup = BeautifulSoup(html)
t = soup.find(id=label)
dat = [ map(str, row.findAll("td")) for row in t.findAll("tr") ]

... or, map a different function is you need to further parse the individual table comments. At any rate, with Beautiful Soup, many things become trivial; it really is an amazing library.

Poison Reverts in Git

August 22nd, 2008

Alice owns the main branch that a bunch of people are using:

A -- B -- C -- D

Bob checks it out, makes changes 0..2, and does regular pulls:

A -- B -- 0 -- 1 -- C -- 2 -- D

Now, Alice pulls Bob's stuff: she has

A -- B -- C -- D -- (0 + 1 + 2)

Alice pulls from another developer:

A -- B -- C -- D -- (0 + 1 + 2) -- E

Bob's patch is bad! How did it get through the audits and QA and unit tests? No matter, revert it. Alice now has:

A -- B -- C -- D -- (0+1+2) -- E -- (-0 -1 -2)

(Note that Alice can't rebase Bob's changes out of the history because other developers are pulling from her).

Bob sees that his commits didn't work, is properly chastened, and makes a fix titled "3".

Now, see the problem:

  • If Bob pulls from Alice, he'll either get a merge conflict if he's made changes, or his stuff will get deleted out of his repo (!)
  • If Alice pulls from Bob, then she'll have problems -- her mainline thinks that it's already taken Bob's changes, but now he's trying to change a deleted file.

So it looks like Git is in a situation where someone or the other (or both) are going to have to do a painful merge conflict resolution. There must be a better way of reverting a patch?

Ocaml Sockets

June 23rd, 2008

There seems to be one standard library in Ocaml for dealing with HTTP, and that's Ocamlnet. Ocamlnet suffers from a few problems, the chief of which is it's difficult to set up unless you use something like GODI to install your packages. (Sadly, this is one thing that Ocaml is still not very strong at; it's not "batteries included" like Python).

Sometimes, you don't want the whole Ocamlnet baggage, you just want the smallest, simplest routine possible to get the contents of a web page. Well, it boils down to just a few lines of Ocaml; we just have to create a socket, connect it to the end point, and then get the results. First we define a function to split a URL into a hostname and everything else:

(* split an url into the (hostname, index) *)
open Unix
open Str

let spliturl url =
    let re = Str.regexp "\\(http://\\)?\\([^/]+\\)\\(/.*\\)?" in
        if Str.string_match re url 0 then
            let host = Str.matched_group 2 url in
            let index = try
                Str.matched_group 3 url
            with Not_found ->
                "/" in
                (host, index)
        else
            raise Not_found

... and then some routines to read and write to a socket...

(* read everything pending in the socket *)
let readall socket =
    let buffer = String.create 512 in
    let rec _readall accum =
        try
            let count = (recv socket buffer 0 512 []) in
                if count = 0 then accum else _readall ((String.sub buffer 0 count)::accum)
        with _ ->
            accum
    in
        String.concat "" (List.rev (_readall []))

(* write everything to a socket *)
let writeall socket s =
    send socket s 0 (String.length s) []

Once you have those bits, the routine that gets the contents of a web page is straightforward.

(* get the contents of an arbitrary URL page *)
let gethttp url =
    let (hostname, rest) = spliturl url in
    let socket = Unix.socket Unix.PF_INET Unix.SOCK_STREAM 0 in
    let hostinfo = Unix.gethostbyname hostname in
    let server_address = hostinfo.Unix.h_addr_list.(0) in
    let _ = Unix.connect socket (Unix.ADDR_INET (server_address, 80)) in
    let ss = "GET " ^ rest ^ " HTTP/1.0\r\nHost: " ^ hostname ^ "\r\n\r\n" in
        writeall socket ss;
        let rv = readall socket in
            Unix.close socket;
            rv

Note that this doesn't stress the error checking much; in fact, it pretty much ignores it. Use netclient in Ocamlnet if you want something robust; this is just something quick.

Oh, you want a quick-and-easy server? That's just slightly more complicated; we need to create the socket, bind it to a port, and then accept any connections that happen and deal with them. Try this:

(* create a server on a given port, and invokes the given function whenever anybody makes a request *)
let httplistener port fn =
    let socket = Unix.socket Unix.PF_INET Unix.SOCK_STREAM 0 in
    let hostinfo = Unix.gethostbyname "localhost" in
    let server_address = hostinfo.Unix.h_addr_list.(0) in
        ignore (Unix.bind socket (Unix.ADDR_INET (server_address, port)));
        Unix.listen socket 10;
        while true do
            let (fd, _) = Unix.accept socket in
            let _ = set_nonblock fd in
            let ins = readall fd in
                ignore (writeall fd (fn ins));
                Unix.close fd
        done

This binds to localhost; if you want to bind it to the world-at-large you'll want to use gethostname () instead of "localhost" on the hostinfo assignment. Note the complete lack of error checking. Some exception throws? You'll lose the socket. Multithreaded or multiprocessing? Nope! Nonetheless, sometimes you just want some quick scaffolding.

(Thanks go to this excellent socket tutorial for Python, from which I cribbed everything and translated to Ocaml)