I have hundreds of html files, all containing the same content table—i.e., the same table_id, the same individual variables in the same order, but with different information in each file.
This is not from a webpage, it's an export from another program. I don't have the program, only the exported html.
I would very much like to get them all into a single spreadsheet in whatever format— and have no knowledge of JS or RegEx.
I just ate up an hour poking around on the forum, and elsewhere, trying to figure out how to do this almost certainly simple task— but there's just no way I am going to brute-force it and don't know where to start.
Any kind soul able/willing to point me in the right direction? Is KM even the right use-case for this?
There's a nice command-line HTML parser called pup that you can install via Homebrew. It uses CSS-style selectors to pull information out of HTML files. For example, if the example you showed us was in a file called info001.html, then running this command in the Terminal,
which might be a good starting point for building your spreadsheet.
I can't give you any further advice, as I don't know how your HTML files are named or how the data sets they provide are distinguished from one another. I'm guessing you'd like to have them organized by date, but apart from the
<h4>Date of Report</h>
near the bottom of the file, there's nothing in the file that tells me the date.
Hi, I haven't returned to this post for awhile (as I wound up doing this task semi-manually) but thank you both very much for your thoughts—can't believe my post attracted the attention of a semi-celebrity. What a great community this is.
I have to deal with this issue semi-frequently, so it's very much worth my working to come up with a solution for future use-cases.
@drdrang, I installed pup and will start to poke around with it. If I get to the point I'm stuck I'll ask— but just in case you're curious, to answer your specific q's/comments:
the HTML files have unique names that themselves provide information not otherwise available in their contents: a category word, subcategory word, sub-sub category numeral, and sequential number, i.e. — Category_subcategory_#_##
Each of the html files contains various records from a database in reference to a single indexed file: a unique identifier that I presume is generated by the parent software, and records of file-table entries like C/A/M date-times, filepath, etc— in addition to various checksums and other metadata associated with each file.
There's enough unique information that I could pick any number of points by which to organize them— filename of the actual html file, the unique ID# provided, or for that matter any one of the checksums/hash values. It doesn't really matter which; the purpose of my hoped-for spreadsheet is to be able to organize and compare this info across a number of different data-points.
Sorry to be so obscuroso about it—hopefully the reason why can be inferred.
Given that this is still, at least potentially, an issue -- do you need parse the HTML at all? A single table (no others nested inside it) will be easy to extract and process as text -- remove all the line-breaks, replace </td><td> with a tab, replace </tr> with a return, nuke all the other HTML tags (unless you want formatting preserved?), and either save out as tab-delimited files for import into Excel or paste in directly from the clipboard.