Export/parse many html tables into a single spreadsheet?

Hi everyone,

I have hundreds of html files, all containing the same content table—i.e., the same table_id, the same individual variables in the same order, but with different information in each file.

This is not from a webpage, it's an export from another program. I don't have the program, only the exported html.

I would very much like to get them all into a single spreadsheet in whatever format— and have no knowledge of JS or RegEx.

I just ate up an hour poking around on the forum, and elsewhere, trying to figure out how to do this almost certainly simple task— but there's just no way I am going to brute-force it and don't know where to start.

Any kind soul able/willing to point me in the right direction? Is KM even the right use-case for this?

thanks

Lucas

I think you would have to show us a sample file here.

If you include the HTML

```

between triple backticks, above and below

```

it will display legiblly.

Thanks! That's what I was afraid of.

I can't give you a real sample, but the following is the general form with all the specific info removed and generalized. I'm quite sure every file is ordered in the same way.

<!DOCTYPE html>
<html>
<head><meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<title>Here it is</title><meta name="keywords" content="" />
<meta name="description" content="" />
<link href="CCS/default.css" rel="stylesheet" type="text/css" />
<link rel="stylesheet" type="text/css" href="ccs/style1.css" />
<script src="CCS/js/modernizr.custom.js"></script>
<script type="text/javascript">
function goTo() {
  var i = parseInt(document.forms[0].pageNumber.value) - 1;
  if (i<0) i = 0;
  if (i>=20) i = 19;
  if (i>=0 && i<20)
  {
    window.location = "ThisOne" + i + ".html";
  }
}
function resizeIframe(obj) {
  var the_height = window.innerHeight - 310;
  obj.style.height = the_height + 'px';
}
</script>
</head>
<body>
<div id="outer">
<div id="header">
<h1>Check out this interesting stuff</h1>
<h2>Courtesy of me!</h2>
</div>
<div id="menu">
<ul>
<li class="first"><a href="../index.html" title="">OK</a></li>
<li><a href="item1.html" title="">item1</a></li>
<li><a href="item2.html" title="">item2</a></li>
<li><a href="item3.html" title="">item3</a></li>
<li><a href="item4.html" title="">item4</a></li>
<li><a href="item5.html" title="">item5</a></li>
</ul>
</div>
<div id="content">
<div id="primaryContentContainer">
<style type="text/css">
.thumbnails img {
max-height: 300px;
max-width: 300px;
border: 4px solid #555;
padding: 1px;
margin: 50px 50px 50px 50px;
}
.thumbnails img:hover {
border: 4px solid #00ccff;
cursor:pointer;
}
</style>
<h4 align="center">
 <a href="ThisIsIt.html"" style="text-decoration:none">&lt;</a> <a href="ThisIsIt.html">1</a> <a href="Something_somethingelse_wow_1.html">2</a> <a href="Something_somethingelse_wow_2.html">3</a> <a href="Something_somethingelse_wow_3.html">4</a> <a href="Something_somethingelse_wow_4.html">5</a> <a href="Something_somethingelse_wow_5.html">6</a> <a href="Something_somethingelse_wow_6.html">7</a> <a href="Something_somethingelse_wow_7.html">8</a> <a href="ThisIsIt.html"">9</a> 10 <a href="Something_somethingelse_wow_10.html">11</a> <a href="Something_somethingelse_wow_11.html">12</a> <a href="Something_somethingelse_wow_12.html">13</a> <a href="Something_somethingelse_wow_13.html">14</a> <a href="Something_somethingelse_wow_14.html">15</a> <a href="Something_somethingelse_wow_15.html">16</a> <a href="Something_somethingelse_wow_16.html">17</a> <a href="Something_somethingelse_wow_17.html">18</a> <a href="Something_somethingelse_wow_18.html">19</a> <a href="Something_somethingelse_wow_19.html">20</a> <a href="Something_somethingelse_wow_10.html" style="text-decoration:none">&gt;</a><form method="post" action="javascript: goTo();">
<div>
<input id="pageNumber" type="text" value=""
onfocus="if(this.value == this.defaultValue) this.value = '';"
onblur="if(this.value == '') this.value = this.defaultValue;">
<input type="submit" value="Go">
</div>
</form>
</h4>
<table  id="content_table"  border = "1" frame="frame" rules="all" width="100%">
<col width="20%" />
<col width="0" />
<tr>
<td>Var1</td>
<td>Var1Content</td>
</tr>
<tr>
<td>Var2</td>
<td>Var2Content</td>
</tr>
<tr>
<td>Var3</td>
<td>Var3Content</td>
</tr>
<tr>
<td>Var4</td>
<td><LA>Var4Content</td>
</tr>
<tr>
<td>Var5</td>
<td>Var5Content</td>
</tr>
<tr>
<td>Var6</td>
<td>Var6Content</td>
</tr><tr>
<td>Var7</td>
<td>Var7Content</td>
</tr>
<tr>
<td>Var8</td>
<td>Var8Content</td>
</tr>
<tr>
<td>Var9</td>
<td>Var9Content</td></tr>
<tr>
<td>Var10</td>
<td>Var10Content</td>
</tr>
<tr>
<td>Var11</td>
<td>var11content</td>
</tr>
<tr>
<td>Var12</td>
<td>Var12Content</td>
</tr>
<tr>
<td>Var13</td>
<td>Var13Content</td>
</tr>
<tr>
<td>Var14</td>
<td>Var14Content</td>
</tr>
<tr>
<td>Var15</td>
<td>Var15content</td></tr><tr>
<td>Var16</td>
<td>Var16content</td>
</tr>
<tr>
<td>Var17</td>
<td>Var17Content</td>
</tr>
</table>
<h4 align="center">
 <a href="ThisIsIt.html"" style="text-decoration:none">&lt;</a> <a href="ThisIsIt.html">1</a> <a href="Something_somethingelse_wow_1.html">2</a> <a href="Something_somethingelse_wow_2.html">3</a> <a href="Something_somethingelse_wow_3.html">4</a> <a href="Something_somethingelse_wow_4.html">5</a> <a href="Something_somethingelse_wow_5.html">6</a> <a href="Something_somethingelse_wow_6.html">7</a> <a href="Something_somethingelse_wow_7.html">8</a> <a href="ThisIsIt.html"">9</a> 10 <a href="Something_somethingelse_wow_10.html">11</a> <a href="Something_somethingelse_wow_11.html">12</a> <a href="Something_somethingelse_wow_12.html">13</a> <a href="Something_somethingelse_wow_13.html">14</a> <a href="Something_somethingelse_wow_14.html">15</a> <a href="Something_somethingelse_wow_15.html">16</a> <a href="Something_somethingelse_wow_16.html">17</a> <a href="Something_somethingelse_wow_17.html">18</a> <a href="Something_somethingelse_wow_18.html">19</a> <a href="Something_somethingelse_wow_19.html">20</a> <a href="Something_somethingelse_wow_10.html" style="text-decoration:none">&gt;</a></h4>
</div>
<div class="clear"></div>
</div>
<div id="footer">
<br>
<center><h4>Date of Report<h4><h4>Thanks for reading!</h4></center>
</div>
</div>
</body>
</html>

That's rather rarer and less likely than you might guess : -)

My first thought would be to get someone to write you an aggregating XQuery over that set of files.

(Could be launched by Keyboard Maestro, of course, but a bit beyond the range of KM's built-in actions)

A simpler tool, which might at least get the HTML files into a more tractable TXT format, is textutil.

(See man textutil on a Terminal.app command line)

There's a nice command-line HTML parser called pup that you can install via Homebrew. It uses CSS-style selectors to pull information out of HTML files. For example, if the example you showed us was in a file called info001.html, then running this command in the Terminal,

pup 'table tbody tr :nth-child(2) text{}' < info001.html

would give this output,

Var1Content
Var2Content
Var3Content
Var4Content
Var5Content
Var6Content
Var7Content
Var8Content
Var9Content
Var10Content
var11content
Var12Content
Var13Content
Var14Content
Var15content
Var16content
Var17Content

which might be a good starting point for building your spreadsheet.

I can't give you any further advice, as I don't know how your HTML files are named or how the data sets they provide are distinguished from one another. I'm guessing you'd like to have them organized by date, but apart from the

<h4>Date of Report</h>

near the bottom of the file, there's nothing in the file that tells me the date.

Hi, I haven't returned to this post for awhile (as I wound up doing this task semi-manually) but thank you both very much for your thoughts—can't believe my post attracted the attention of a semi-celebrity. What a great community this is.

I have to deal with this issue semi-frequently, so it's very much worth my working to come up with a solution for future use-cases.

@drdrang, I installed pup and will start to poke around with it. If I get to the point I'm stuck I'll ask— but just in case you're curious, to answer your specific q's/comments:

  • the HTML files have unique names that themselves provide information not otherwise available in their contents: a category word, subcategory word, sub-sub category numeral, and sequential number, i.e. — Category_subcategory_#_##

Each of the html files contains various records from a database in reference to a single indexed file: a unique identifier that I presume is generated by the parent software, and records of file-table entries like C/A/M date-times, filepath, etc— in addition to various checksums and other metadata associated with each file.

There's enough unique information that I could pick any number of points by which to organize them— filename of the actual html file, the unique ID# provided, or for that matter any one of the checksums/hash values. It doesn't really matter which; the purpose of my hoped-for spreadsheet is to be able to organize and compare this info across a number of different data-points.

Sorry to be so obscuroso about it—hopefully the reason why can be inferred.

thanks again

Given that this is still, at least potentially, an issue -- do you need parse the HTML at all? A single table (no others nested inside it) will be easy to extract and process as text -- remove all the line-breaks, replace </td><td> with a tab, replace </tr> with a return, nuke all the other HTML tags (unless you want formatting preserved?), and either save out as tab-delimited files for import into Excel or paste in directly from the clipboard.