Brool brool (n.) : a low roar; a deep murmur or humming

Parsing Tables With Beautiful Soup

 |  beautiful soup coding python

Just a quick snippet, since it is obvious after writing it but was not obvious while searching for it:

html = file("whatever.html") soup = BeautifulSoup(html) t = soup.find(id=label) dat = [ map(str, row.findAll("td")) for row in t.findAll("tr") ]

… or, map a different function is you need to further parse the individual table comments. At any rate, with Beautiful Soup, many things become trivial; it really is an amazing library.

Discussion

Comments are moderated whenever I remember that I have a blog.

tim | 2015-07-12 19:13:01
There a few really rare cases where you might parse HTML through a regex (say, machine-generated HTML that you need to parse into a database and you'll only run it once and you just need to get the data imported, already) but otherwise... no, just don't do it. The canonical reference on this topic is <a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454" rel="nofollow">this answer on StackOverflow</a>.
Reply
David Underhill | 2010-12-18 05:05:16
Thanks, this is pretty much what I needed. I tweaked it a little get the contents of columns (no td tags): rows = [[c.string for c in row.findAll("td")] for row in t.findAll("tr")]
Reply
Sankalp Agarwal | 2011-01-06 13:24:42
It works seamlessly but doesn't work if the person has inserted tags in between .. E.g. ... .... . . </b> ..... .....
Reply
Minder | 2009-12-08 19:12:22
Thanks! That was very helpful :D
Reply
Amos | 2015-06-27 01:08:02
I've tried using Beautiful Soup as well and didn't like the overhead or dcmooentatiun. But maybe it has gotten better now. Sounds like you have an aversion to using regular expressions for this. I'm curious why? In my mind, the nice thing about them is that they are pretty much cross-language compatible.
Reply
Add a comment