HTML

Manifold can read tables within HTML files.   This capability is often used to harvest tables from saved web pages.    

 

ico_nb_arrow_blue.pngManifold uses Microsoft facilities to connect to all Microsoft Office formats, including .html, .htm and other legacy Office formats such as .db, .mdb, .xls, and .wkx, together with newer Office formats such as .xlsx and .accdb.  If Manifold cannot import from such formats, that means the Windows system we are using is missing the necessary facilities.  Please see the Microsoft Office Formats - MDB, XLS and Friends topic for a solution.

Create an Example HTML File

In this example we visit a Wikipedia page giving a table with a list of Roman amphitheatres.   We would like to import that table into a Manifold project. To do that we will save the table in an HTML file.

 

eg_import_html01_04.png

 

In theory, we could simply tell our browser to save the page we are looking at as an HTML file.   The problem with that is that modern web pages contain a seeming infinity of junk, often including many tables that are not of interest.    It is easier to simply copy the table we are interested in, paste it into some convenient editor, and then save as an HTML file.   

 

We highlight the table of interest in the web page and we press Ctrl-C to copy the table to the Windows Clipboard.

 

eg_import_html01_05.png

 

We launch Microsoft Word to a new, blank document and we Ctrl-V to paste the table from the clipboard.   Word tries its best to copy everything it can, including links and images from the copied Wikipedia table.

 

dlg_save_html_word.png

 

We save the document as a web page.   Manifold can import from either .htm or .html.  

 

Import from HTML

Launch Manifold and choose File - Import.

 

dlg_import_html.png

 

To import from HTML format:

 

  1. Choose File-Import from the main menu.

  2. In the Import dialog browse to the folder containing data of interest.

  3. Double-click the file ending in .htm or .html for the data of interest.

  4. One or more tables will be created.

 

 

eg_import_html01_01.png

 

ico_nb_arrow_blue.pngIf tables from the .htm are not created as shown above, that means the Windows system we are using is missing facilities necessary for a connection to HTML. Please see the Microsoft Office Formats - MDB, XLS and Friends topic for a solution.

 

 

We can double-click on tables that are created to view them.  

 

eg_import_html01_02.png

 

The table appears as imported.  Given the astonishing amount of junk encountered in tables in web pages in modern times, it is mildly surprising the table is as clean as it is.   The gray background shows that the table has no index and thus is neither selectable nor editable.  

 

We follow the procedure in the Add an Index to a Table  topic to add an index to the table.

 

eg_import_html01_03.png

 

We can now edit the table as we like.   For example, we can right-click on the first cell of the first row and choose Edit to see the contents of that cell.   This is a fairly typical situation for a Wikipedia table, where numerous links are embedded in the table.   

 

We can get rid of those links by applying our expert editing skills using Regular Expressions and the Transform panel of the Contents pane:

 

eg_import_html01_06.png

 

For example, we can use the Replace Regexp, All transform template to clean up the City_(Roman name) field contents by using the regular expression #.*#  as the Pattern to search for, with nothing in the Replace with box.    All text matching the pattern will be replaced with nothing, that is, deleting all instances of text that match the pattern.

 

The regular expression #.*# says to match any sequence of characters that begins with a # character, followed by one or more of any character, and ended with a # character.  That's exactly the link expression we want to eliminate from that field, to leave only the city names.

 

eg_import_html01_07.png

 

The Transform panel helpfully shows a preview in blue preview color of what will happen when we press the Update Field button.  

 

Notes

Plenty to do - Most tables we harvest from web pages will require significant tinkering.  We will adjust the field names to more sensible names, we will use many different editing techniques, and we may find ourselves copying between fields to clean up messy imports.   The more expertise we develop with tools like the Transform panel, transform templates, regular expressions, the Select panel and similar, the less effort we will expend and the quicker our work will go.

 

See Also

Tables

 

Add an Index to a Table

 

Regular Expressions

 

Contents - Transform

 

Contents Pane

 

File - Create - New Data Source

 

DBMS Data Sources - Notes

 

Example: Closing without Saving - An example that shows how File - Close without saving the project can affect local tables and components differently from those saved already into a data source, such as an .mdb file database.

 

Example: Create and Use New Data Source using an MDB Database - This example Illustrates the step-by-step creation of a new data source using an .mdb file database, followed by use of SQL.  Although now deprecated in favor of the more current Access Database Engine formats, .mdb files are ubiquitous in the Microsoft world, one of the more popular file formats in which file databases are encountered.  

 

Example: Switching between Manifold and Native Query Engines - How to use the !manifold and !native commands to switch a query in the Command window from use the Manifold query engine to whatever query engine is provided by a data source.

 

Microsoft Office Formats - MDB, XLS and Friends