Manifold can read tables within HTML files.   This capability is often used to harvest tables from saved web pages.    


ico_nb_arrow_blue.pngManifold uses Microsoft facilities to connect to all Microsoft Office formats, including .html, .htm and other legacy Office formats such as .db, .mdb, .xls, and .wkx, together with newer Office formats such as .xlsx and .accdb.  If Manifold cannot import from such formats, that means the Windows system we are using is missing the necessary facilities.  Please see the Microsoft Office Formats - MDB, XLS and Friends topic for a solution.

Create an Example HTML File

In this example we visit a Wikipedia page giving a table with a list of Roman amphitheatres.   We would like to import that table into a Manifold project. To do that we will save the table in an HTML file.




In theory, we could simply tell our browser to save the page we are looking at as an HTML file.   The problem with that is that modern web pages contain a seeming infinity of junk, often including many tables that are not of interest.    It is easier to simply copy the table we are interested in, paste it into some convenient editor, and then save as an HTML file.   


We highlight the table of interest in the web page and we press Ctrl-C to copy the table to the Windows Clipboard.




We launch Microsoft Word to a new, blank document and we Ctrl-V to paste the table from the clipboard.   Word tries its best to copy everything it can, including links and images from the copied Wikipedia table.




We save the document as a web page.   Manifold can import from either .htm or .html.  


Import from HTML

Launch Manifold and choose File - Import.




To import from HTML format:


  1. Choose File-Import from the main menu.

  2. In the Import dialog browse to the folder containing data of interest.

  3. Double-click the file ending in .htm or .html for the data of interest.

  4. One or more tables will be created.





ico_nb_arrow_blue.pngIf tables from the .htm are not created as shown above, that means the Windows system we are using is missing facilities necessary for a connection to HTML. Please see the Microsoft Office Formats - MDB, XLS and Friends topic for a solution.



We can double-click on tables that are created to view them.  




The table appears as imported.  Given the astonishing amount of junk encountered in tables in web pages in modern times, it is mildly surprising the table is as clean as it is.   The gray background shows that the table has no index and thus is neither selectable nor editable.  


We follow the procedure in the Add an Index to a Table  topic to add an index to the table.




We can now edit the table as we like.   For example, we can right-click on the first cell of the first row and choose Edit to see the contents of that cell.   This is a fairly typical situation for a Wikipedia table, where numerous links are embedded in the table.   


We can get rid of those links by applying our expert editing skills using Regular Expressions and the Transform panel of the Contents pane:




For example, we can use the Replace Regexp, All transform template to clean up the City_(Roman name) field contents by using the regular expression #.*#  as the Pattern to search for, with nothing in the Replace with box.    All text matching the pattern will be replaced with nothing, that is, deleting all instances of text that match the pattern.


The regular expression #.*# says to match any sequence of characters that begins with a # character, followed by one or more of any character, and ended with a # character.  That's exactly the link expression we want to eliminate from that field, to leave only the city names.




The Transform panel helpfully shows a preview in blue preview color of what will happen when we press the Update Field button.  



Plenty to do - Most tables we harvest from web pages will require significant tinkering.  We will adjust the field names to more sensible names, we will use many different editing techniques, and we may find ourselves copying between fields to clean up messy imports.   The more expertise we develop with tools like the Transform panel, transform templates, regular expressions, the Select panel and similar, the less effort we will expend and the quicker our work will go.


