Index

When I need to scrap data online, I use Python with requests, and lxml, two libraries taht make it easy to extract data without going crazy.

Often I come accross HTML tables with data formatted like this:

<td>
    <a href='/data1'><strong>data1</strong></a>
</td>
<td>
    data2
</td>
<td>
    data<em>3</em>
</td>

In that case we’d just like to extract the list data1, data2, & data3 from the table. With the different markup in each cell it would take quite a bit of elbow grease to clean it up. lxml has a special method that makes all that easy: text_content. Here’s what the documentation says about it:

Returns the text content of the element, including the text content of its children, with no markup.

For the previous HTML snippet we’d extract the data like this:

>>> from lxml import html
>>> root = html.fromstring('''    <td>
...         <a href='...'><strong>data1</strong></a>
...     </td>
...     <td>
...         data2
...     </td>
...     <td>
...         data<em>3</em>
...     </td>
... ''')
>>> [i.text_content().strip() for i in root.xpath('//td')]
['data1', 'data2', 'data3']