Extracting Data from Excel

I have always had a special respect for Excel.  I have thought of it as the one of the best pieces of software out there (along with Unix and the TCP/IP stack of the original BSD).  Sometimes, however, it can be a serious pain in the neck, especially when you are trying to work with it programmatically.

Microsoft Excel logoI am not the only person who thinks highly of Excel as an application, Information Week called it the great implementation of the spreadsheet. Working with it programatically is a different story, however.

Recently, I had the task of importing information from an Excel file, something that, on paper, sounds like money in the bank.  However, once I started coding the import jobs, the horror was right around the corner.

In this post I review the path I followed, so you don’t have to suffer through it as I did.

The Easy Way (or so I Thought)

First, I tried the easy way, using ADO.NET.  I thought of reading the file using the Excel object model (i.e., using the Excel application objects), but that was too much work for something that I thought could be achieved in a simpler and elegant way.  So, I used a connection string in this form,

Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:MyExcel.xls;Extended Properties=”Excel 8.0;HDR=Yes;”;

The first complication was that you have to use the ACE engine to read XLSX files, so I had to change the connection string to this,

Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:MyExcel.xls;Extended Properties=”Excel 12.0;HDR=Yes;”;

Once I got the right connection string, I started to think about how to get the data.  This file had no header row to help with the parsing, so I came up with this (in the select query while filling an OleDBDataAdapter):

String query = String.Format(“SELECT * FROM [sheetName$A10:AA]”);  OleDbDataAdapter excelAdapter = new OleDbDataAdapter(query, connectionStringBuilder.ToString());

That way I can get all the cells in the A10 row, up to the AA column.

Then, I wanted to query those results in with LINQ, and there is an easy way to do this by filling a DataSet through the OleDBAdapter,

DataSet excelData = new DataSet();  excelAdapter.Fill(excelData, “sheetName”);  return excelData.Tables[“sheetName”];

This returns a DataTable which you can query like this:

foreach(DataRow row in excelDataTable) {
    row.Field(“columnName”);

The ACE Engine and the Data Type of a Column

Now, there are a few things to point out here:

  1. When you connect to the ACE engine and you specify HDR=YES, you are telling the engine that the first row in you range is the “Header Row”
    • If you have a header row, then all the fields will get that name and can be searched with it.
    • If you do not have a header row or you put HDR=NO, then all the fields are taken as information, an the names of the field are like “F1”, “F2”, etc.
  2. When calling the Field method, you have to specify a type.

HACK #1:  If you do not know the type of your columns (or simply do not want to cast everything to a particular type), use Field and the ToString method, it will return the type of the Field in plain text.

Also, there are two interesting things related to the way the ACE engine infers the data type.

First, the ACE engine, by default, looks at the first eight rows to determine the column type. It actually consults a Windows registry entry {Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/TypeGuessRows} to determine how many rows it should read.

The engine looks at the different types in a column and the one that appears most often, wins. This is crazy but, hey, that’s the way it works.

HACK #2: You can change this registry key and set it to a number between 1 and 16 to indicate the number of rows that you want it to scan.  If you set it to 0, it will force the engine to scan 16,384 rows.

Second, the IMEX option indicates how “intermixed” records should be interpreted.  If for some reason the engine cannot determine the column type, the IMEX indicates how to do it: set it to “1” if you want it to get it as text or to “0” to leave the default behavior.  This can also be set in a Windows Registry key {Hkey_Local_Machine/Software/Microsoft/Jet/4.0/Engines/Excel/ImportMixedTypes}.

One Final Tip

Once you establish these things in your head, the reading/writing of Excel files through ADO.NET should be like money in the bank.  Kind of.

Here’s the final tip: if you’re getting an error indicating that “the table is in a wrong format” or “could not decrypt the file,” please check that the file is not password protected.

Focus Mode

Contact Request

Close

We will call you right away. All information is kept private