When I need to scrap data online, I use Python with requests, and lxml, two libraries taht make it easy to extract data without going crazy.
Often I come accross HTML tables with data formatted like this:
<td> <a href='/data1'><strong>data1</strong></a> </td> <td> data2 </td> <td> data<em>3</em> </td>
In that case we’d just like to extract the list data1, data2, & data3 from the table. With the different markup in each cell it would take quite a bit of elbow grease to clean it up. lxml has a special method that makes all that easy: text_content. Here’s what the documentation says about it:
Returns the text content of the element, including the text content of its children, with no markup.
For the previous HTML snippet we’d extract the data like this:
>>> from lxml import html >>> root = html.fromstring(''' <td> ... <a href='...'><strong>data1</strong></a> ... </td> ... <td> ... data2 ... </td> ... <td> ... data<em>3</em> ... </td> ... ''') >>> [i.text_content().strip() for i in root.xpath('//td')] ['data1', 'data2', 'data3']