When I need to scrap data online, I use Python with requests, and lxml, two libraries taht make it easy to extract data without going crazy.
Often I come accross HTML tables with data formatted like this:
<td>
<a href='/data1'><strong>data1</strong></a>
</td>
<td>
data2
</td>
<td>
data<em>3</em>
</td>
In that case we’d just like to extract the list data1, data2, & data3 from the table. With the different markup in each cell it would take quite a bit of elbow grease to clean it up. lxml has a special method that makes all that easy: text_content. Here’s what the documentation says about it:
Returns the text content of the element, including the text content of its children, with no markup.
For the previous HTML snippet we’d extract the data like this:
>>> from lxml import html
>>> root = html.fromstring(''' <td>
... <a href='...'><strong>data1</strong></a>
... </td>
... <td>
... data2
... </td>
... <td>
... data<em>3</em>
... </td>
... ''')
>>> [i.text_content().strip() for i in root.xpath('//td')]
['data1', 'data2', 'data3']