CSV, ASC, PSV, TAB

Manifold can read Comma Separated Values (CSV)  and similar formats that represent tables within text files, where each line is a record and the values for fields are separated by a character such as a comma (CSV). Each CSV file contains one table.   A similar format uses a | "pipe" character and is often knonw as Pipe Separated Values, or PSV file format.    

 

We can bring data from CSV files into Manifold in two ways:

 

 

 

The first option above, creating a new data source, provides more options when harvesting data from CSV files.   We therefore will focus on connecting to CSV files using the File - Create - New Data Source command.  If we prefer to import the data into our project we can copy the table from the CSV data source and paste it into the project.

 

Connecting to a CSV file:

 

  1. Choose File - Create - New Data Source in the main menu, or right-click in the Project pane and choose New Data Source.

  2. Choose File: csv in the Type box.

  3. Click the [...] browse button to navigate to the desired .csv file and to open it.

  4. Default settings will usually work for most CSV files.   Press Create Data Source.

  5. A new data source appears in the project.  Open it to see the table that is linked in from the CSV file.

 

Creating a local copy of the CSV table:

 

  1. Click on the table linked in from the CSV to highlight it.

  2. Press Ctrl-C or click the Copy button in the Project pane toolbar.

  3. Click anywhere in the main part of the project outside of the new data source's hierarchy.

  4. Press Ctrl-V or click the Paste button in the Project pane toolbar.

 

Tables that are linked from CSV files are neither selectable nor editable since they do not have any indexes.  We cannot add an index to a linked CSV table since CSV format does not support indexes.  

 

Local tables that are imported from CSV or copied and pasted from a linked CSV table are fully capable, fully DBMS Manifold tables.   We can easily add an index to them to enable editing and selection, using the procedure given in the Add an Index to a Table topic.

 

Adding an index to a table:

 

  1. Double-click on the table to open it.

  2. Choose Edit - Schema.

  3. In the Schema list, click the <new field> item.

  4. In the Field box specify mfd_id as the name of the field.

  5. In the Type box choose int64.

  6. Press the Add button.  A new field appears in the Schema list.

  7. In the Schema list, click the <new index> item.

  8. In the Index box specify mfd_id_x as the name of the index.

  9. Press the Add button.  A new index appears in the Schema list.

  10. In the Schema list under the new mfd_id_x index, click the <new field> item.

  11. In the Field box choose mfd_id from the pull-down list.   Press Add.

  12. The mfd_id field appears as a new key field under the index.

  13. Press OK.

 

 

dlg_formats_csv01_01.png

 

Name

Name for the new data source, "Data Source" by default.  Specify a more  memorable name as desired.  If we forget the origin of a data source we can hover the mouse over the data source and a tool tip will provide connection information.

Type

To connect to a CSV file, choose File: csv from the very long list of data sources in the pull down menu.

Source

A connection string or path to the CSV file.  The source string can be manually entered or it can be created using a Windows dialog launched by the [...] browse button.

[...] Browse button

Click to use a standard Windows Select File dialog (very similar to the usual File Open dialog) to choose the desired CSV file.

Open as read-only

CSV files in general are not writable even if the file itself on disk is not specified read-only by Windows. All the same, Manifold creates a read/write mfd_meta file that can be used to store characteristics like the width of columns and other properties within the project file.  Checking the Open as read-only box forces the entire data source to be read-only.

First line contains field names

Checked by default.   The first line in most CSV files gives the names of fields.  Some CSV files do not.

Read all fields as text

The system will try to guess what fields contain text and what fields contain integers or floating point numbers.  In some cases, such as Zip codes in US postal addresses, what appears to be a number should be treated as text.  Checking this box imports all fields as text even if they are not marked as text and even if they appear to be numeric fields.

List delimiter

The character used within the CSV file to separate fields within a record. The default (auto) setting usually guesses correctly.   Enter any character or choose it from the list of standard characters: comma, colon, semicolon, pipe, space or tab.

Decimal separator

The character used within the CSV file to denote the decimal fractional part of a number, almost always either a period or a comma.    The default (auto) setting usually guesses correctly.   Enter any character or choose it from the list of two standard characters: period or comma.  A period is used in North America and many former UK colonies.  A comma is used in Europe, South America and many African countries.

Text qualifier

The character used within the CSV file to wrap values intended to be text.  The default (auto) setting usually guesses correctly.   Enter any character or choose it from the list of two standard characters:  a double " quote character or a single ' quote character.

Create Data Source

Create the new data source in the project pane and close the dialog.

Edit Query

Launch the Command Window loaded with a query that creates the data source using the given settings.  A great way to learn how to use SQL to create data sources.

Cancel

 Exit the dialog without doing anything.

 

dlg_formats_csv01_02.png

 

The Select File dialog is a variation of the well-known Windows File Open dialog.   By default, it shows files ending in the standard .asc, .csv, .psv, and .tab file extensions used with CSV files.   Click on the file to load it into the File name box.   If we like, we can write the name of any file manually into the File name box.   For example, if we have a CSV file that uses a .txt extension, we can put the name employees.txt into the File name box and that will be read as well.

About CSV Format

"Comma-separated value" format is a text file format used to convey information in tables.   Each line in the text file contains data for a single record.   Values for each field are separated by a special character, for example, a comma.   The specific character that is used to separate field values is called the list delimiter character.   Usually a comma is used (hence the name "comma" separated value files...), but other characters can be used as list delimiters, especially if the data in the file most convey text that contains commas.

 

Much of how CSV files are used is a matter of tradition that has become accepted over the centuries since the original introduction of CSV format, about the time of the domestication of the ox, and the introduction of IBM FORTRAN in 1972.    Over the years, various traditions and attempts at standardization have emerged, often trying to deal with how to represent commas in the data when the comma character itself is used as part of the format.

 

The idea behind CSV is simple.  Consider a file called employees.csv which contains the following text:

 

Last Name, Country, Height

Fuller, USA, 172

Suyama, UK, 165

Smith, USA, 185

Dodsworth, UK, 187

Martin, France, 180

Pereira, France, 169

 

By tradition, the first line in a CSV file gives the names of the fields in the table, with the name of each field being separated by a comma , character.   Based on the names of the fields and the contents of the file, we might guess that the intended table is a list of employees, giving the last name of each, their country, and the height of the employee in centimeters.   

 

il_formats_csv01_00.png

 

Manifold will read our example employees.csv text file and will create a table from it that looks like the above, with the Last Name and Country fields being text data types and the Height field being an integer data type.  

Telling Text from Numbers

In the example file above it is obvious that the values for Height are intended to be numbers.   When all records for a specific field contain only the numeric characters 0, 1, 2, ..., 9 it is usually a fair guess that field is intended to be a number.  But sometimes that is not a correct guess.  For example, Zip codes for US postal addresses cannot be treated as numbers given that  leading zeros in Zip codes are significant.   If our Zip code is 02138, a Zip code in Cambridge, Massachusetts, the leading zero must be preserved.

 

To avoid the need to guess,  CSV files use long-accepted traditions for manually specifying what values are supposed to be text.  A convention almost universally respected in programs that create or consume CSV files is that any content wrapped with quote characters is to be treated as text.   We could write the employees.csv file as:

 

"Last Name","Country","Height"

"Fuller","USA",172

"Suyama","UK",165

"Smith","USA",185

"Dodsworth","UK",187

"Martin","France",180

"Pereira","France",169

 

Since of course it would be far too easy for everyone to agree on such a simple rule, some people use double quote " characters to wrap a value intended as text and other people using a single quote ' character to wrap text.  Manifold can handle either or any other character that is used.

 

Checking the Read all fields as text option tells Manifold to interpret all values as text data types, to force exclusively number content, like Zip codes, to be read as text data whether it is wrapped with quote characters or not.

Dealing with Commas

It seems obvious that whoever first decided to use commas as a list delimiter to separate field values in CSV files did not expect to use CSV files to store text values such as street addresses, since constructions such as "123 Main Street, Smallville, Ohio" obviously use commas, nor did they expect to store text that includes sentences as often appear in descriptions and comments.    That is such a painful restriction in CSV format that there are two ways to get around the restriction.

 

Commas within quoted text do not count as list delimiters.

A different character, such as a pipe | character, is used as a list delimiter.

 

The first method is that commas used within quoted text do not count as list delimiters.   For example, if we wrote the employees.csv file as:

 

"Name","Country","Height"

"Fuller, Andrew","USA",172

"Suyama, Michael","UK",165

"Smith, Tim","USA",185

"Dodsworth, Anne","UK",187

"Martin, Xavier","France",180

"Pereira, Laurent","France",169

 

Then the resulting table would be

il_formats_csv01_05.png

 

The above is obviously prone to error and confusing.    The second method is to avoid using a common character, like a comma, that is often found in text and instead to use a rare character such as a pipe | symbol.     For example, if we wrote the employees.csv file as:

 

Name|Country|Height

Fuller, Andrew|USA|172

Suyama, Michael|UK|165

Smith, Tim|USA|185

Dodsworth, Anne|UK|187

Martin, Xavier|France|180

Pereira, Laurent|France|169

 

Then the resulting table, whether or not we chose to continue using quotes to set off text, still would be

il_formats_csv01_05.png

 

Using commas is such a wretched choice when saving text info in tables and the pipe character tends to lend such clarity that using the pipe character is becoming more popular.   In fact, it is so popular that Pipe separated values (PSV) files have acquired a name and a three letter extension all their own.  Super!

Commas and Periods in Numbers

If we have spent our entire lives in the US we may be surprised to learn that in many other countries people do not write pi as 3.141 but instead write it as 3,141 and by that they do not mean three thousand one hundred forty one but instead mean the same as 3.141 means in the US.   That is, they use the comma ,  character to denote the fractional part of a decimal fraction.   Likewise, someone in France might be surprised to learn that in the US they do not write pi as 3,141.  

 

Cultures in some countries, primarily in Europe, South America and Africa, utilize a comma , character to indicate a decimal point.   In those cultures, the value of pi, a magic number relating the circumference of a circle to the radius of the circle, is written approximately as 3,141.  Other countries, primarily in North America, and the UK and former British colonies, use the period . character, to write pi as approximately 3.141.    To confuse the matter further,  either the comma or the period may be used as a "thousands" separator to group digits.      

 

A CSV file that contains text such as 3,141 depending on the culture that wrote the text could be intended to mean a numeric value about three and a tenth, or a numeric value of three thousand one hundred and forty one.    Likewise, text such as 2.718 in some cultures could either mean the value of e (another magic number first named as "e" by Leonhard Euler in the early 1700s), that is, a fractional number more than two but less than three, or in other cultures 2.718 could mean two thousand seven hundred eighteen.

 

Given the popularization of CSV by American companies such as IBM, the use of a comma character to separate field values made European-style use of the comma character as a decimal separator infeasible in CSV.      The use of a comma character to separate field values was also a painful restriction, in that it prevented a uniform way of emerging to use commas in field names, such as Population, 1990 or in text values such as complete addresses, as in 123 Main Street, Richmond, Virginia.   

 

Almost immediately, therefore, Europeans and other hip people created variations on CSV format that used other characters to separate field values.  Popular choices over the years have been the space character (obviously a limiting choice given the tendency of text to use space characters), the tab character, a semicolon ; character, a colon : character and, most recently, the pipe | character.    Using pipes to separate field values has become so popular that such files even have a name of their own: pipe-separated value  files, using the .psv three-letter extension.

 

Manifold will auto-recognize the above popular choices in most circumstances, or the desired separator can be specified.

Other Characters used as List Delimiters

We have touched on using the pipe character as an alternative to commas.   Other characters are also popular.   Older CSV files will often use characters that are not commas but which are also frequently found in ordinary text, such as colons, semicolons, space characters or tab characters.     Sometimes CSV files are created using manual editing where HTML tables are converted into CSV format, using tricks like replacing <td> tags with a list delimiter.   In such cases users will sometimes use a character, like a caret ^ character, that does not occur within the data.  Manifold can handle such characters as well.

 

The following table will be created using of the various characters shown in the following Notepad windows in use as list delimiters:

 

il_formats_csv01_00.png

 

Comma , character:

 

il_formats_csv01_01.png

 

Pipe | character:

 

il_formats_csv01_02.png

 

Caret ^ character:

 

il_formats_csv01_03.png

 

Tilde ~ character:

 

il_formats_csv01_04.png

 

We can enter whatever character we want into the List delimiter combo box in the New Data Source dialog.

 

dlg_formats_csv01_03.png

 

For example, in the illustration above we have entered the caret ^ character as the List delimiter.

Automatic Adjustments

In most cases, the (auto) setting will work fine given that Manifold is reasonably good at auto-detecting how a given CSV file should be interpreted.  By manually specifying options we can guide the system in better interpreting a file.   If options we specify conflict with each other, the Manifold CSV dataport ignores the value for the less important option.   For example if the List delimiter is set to (auto) and when Manifold scans the file the system discovers that the list separator is a comma, then the dataport will not allow the Decimal separator character to also be a comma.

 

Exports

Manifold exports CSV files always using comma characters as list delimiters, periods as decimal separators and double quotes as text qualifiers.   Exporting boolean true or false values to CSV puts them within quotes.  When text values contain line breaks, exporting text values to a CSV file replaces line breaks with spaces to make sure exported data can be imported by as many products as possible.

 

See Also

Tables

 

Data Types

 

File - Create - New Data Source

 

Example: Import CSV and Create a Drawing - Import a CSV file by creating a New Data Source, copying and pasting the table conveyed by the CSV, prepping the table for use as a geocoded table and then creating a drawing.