Clues to Excel calculation performance


For any practical size spreadsheet, the organization of calculations across sheets in a workbook may have a small an impact the first time a calculation is executed.  To gain some understanding of the likely impact it can be useful to review how Excel stores workbook data because some of the structures used either have a direct impact on performance or have been implemented as the developers implemented techniques to improve Excel read, write and calculation performance.  The documents detailing these structures are available from the Microsoft web site: http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx.

In summary, assuming that a consistent number of calculations are performed then if the calculations require access to data, or dependent calculations, on sheets other than the selected sheet there maybe a slight impact the first time the calculation chain is executed.  After that there will be no reproducible difference.  Excel also contains, maintains and uses a calculation graph.  You can see this graph most easily by unzipping an Excel 2007 file to review the calcChain.xml file in the /xl/ folder.  Once the graph has been built, Excel will maintain it dynamically and use it to optimize calculation performance.  If the calculation graph becomes damaged Excel can be instructed to re-build the graph using the key combination CTRL+SHIFT+ALT+F9.

A workbook document, whether in the old (97-2003) or new (2007) format, contains a number of logical parts or streams.  The main part is the workbook stream itself.  This can contain many types of information but key data are the styles, summary details of the worksheets, pivot table cache summary information, name definitions, and shared string table (SST).  The SST is important.  Rather than scatter strings across worksheets, Excel gathers them into a single structure, the SST, and replaces the string in a cell or formula with a reference to an entry in the string table allowing Excel to replace duplicate strings with a simple numerical reference.  There is a consequence to this efficiency trick: if a workbook contains sheets composed mainly of strings, Excel will have to read *all* the strings in *before* it does anything else.  In all cases the entire SST has to be read before any worksheets can be rendered.

The workbook part contains a list of all the worksheets.  In Excel 97-2003 format these are called “bound sheets”.  This boundsheet information tells Excel the name of the sheet and where in the file the data for each worksheet begins and how long it is.  This means that if Excel wants to extract the information for just the selected sheet, it can.  In the OOXML format the data for each sheet is contained in a separate “part” or file within the zip file.  So you can see that if all the calculations are on the selected sheet, there is an advantage because Excel can present the results of a calculation without reading data for all worksheets.  You can see this behaviour when opening large workbooks using Excel 2007.  The selected sheet is usually displayed quickly but the progress bar will show Excel’s progress as it reads data and prepares the display for all the other sheets.  However, if any of the cells on the selected sheet are volatile or semi-volatile (and so need to be recalculated) and those if recalculations have dependencies on cells in other worksheets then Excel will need to read the data for those worksheets (and perform any calculations) before it can complete the display of the selected sheet. 

However, the benefit of having calculations and data on one worksheet is transient.  Excel will ultimately read the data for all worksheet and once in memory the advantage disappears because a calculation of a formula at one memory location is no more or less efficient than the calculation of a formula at another memory location.

There are further optimizations in the file structure, especially in the binary format (Excel 97-2003 only had a binary format but Excel 2007 supports the .xlsb format).  Each different type of information in a worksheet – cell values for example –  is held in a record.  These records begin with the length of the record followed by a unique “type” code.  Text cells have one type code, numbers another, a formula generating a number another, a formula generating a text value another and so on.  So Excel does not need to read and parse all cell data to find out if there are formulas in a sheet.  It can scan the worksheet data very efficiently looking at each header for records that contain one of the 7 formula record type codes and use the record length information to move along to the next record.  The new XML format cannot be as efficient because XML is not indexed.

Its this kind of efficiency that makes the “SpecialCells” function work so well.  Using the special cells function VBA is able to return the count of any single cell type or collection of cell types.  For example this VBA line will return a count of all the formulas in a sheet:

sheet.Cells.SpecialCells(xlCellTypeFormulas, xlNumbers Or xlTextValues Or xlErrors Or xlLogical).Count

(I’ve used “sheet” here but it could be any 2D range).  A user writing VBA cannot hope to be as efficient when scanning a worksheet.

Excel will require that all workbook data is loaded into memory sooner or later.  Windows virtual memory means that is it conceivable a user can open a file larger than the available memory but only at the cost of performance and/or reliability.  Windows provides the illusion of virtual memory by paging real memory to/from disk.  Disk access is 1000 or more times slower than memory access leading to the performance penalty.  If it works out that Excel needs more of the workbook data to be in memory to complete a calculation, Excel will fail.  These two points probably explain the performance and reliability issues mentioned by other posters.  Except for the 255 sheet, 255 column and 64K row limits, there’s no inherent restriction to the volume of data Excel can handle so these issue will probably be resolved by adding more memory.  A 64-bit CPU can address many, many gigabytes of RAM and since a PC with a dual-core 64-bit CPU and 16GB RAM could be bought for less than £500 over the summer the cost to resolve this issue is not great.

Excel 97-2003, or the Excel 2007 binary format save data in a Microsoft file format type known as structured storage file.  This is a like a disk within a file as the file is able to contain logical files and sub-folders.  The workbook data is contained file called “Workbook”.  However this stream does not contain all workbook data.  VBA code is held in a series of sub-folders beginning with a sub-folder called “VBA”.  Likewise revision log information is held in another file, embedded images or documents are held in yet other files within the storage file.  Pivot table data also stored within its own structure.  If a pivot table has been created from an external source Excel will cache this data in logical file.  This means that Excel does not need to read all the external pivot table data when it reads a sheet containing a pivot table.  Only the results of the last pivot are read (just like any other set of cells) along with a handful of control records.

My take is that there will be no one absolute answer to you question because of the different type of information that can be included in Excel.  A review of the information Microsoft has made available can give you a good idea of the likely impact of the structure of data in a workbook.

Information and Links

Join the fray by commenting, tracking what others have to say, or linking to it from your blog.


Other Posts

Write a Comment

Take a moment to comment and tell us what you think. Some basic HTML is allowed for formatting.

Reader Comments

Thanks. For the most part this places a detailed rationale for issues that I was already aware of.

Worksheet code for counting formulae in a Worksheet does not appear to operate in Excel 2003, but I’ll keep checking.

Volatile functions will always remain an issue. We certainly use them extensively in some types of functionality. They are particularly useful in modelling approaches.

The general rule appears to be that having lots of worksheets does not (necessarily) make a difference to the efficiency of a workbook. The problems associated with finding the particular bit thst you want are far more important.

Yes, I think your summary is about right.

We provide a free add-in called Workbook Statistics (http://www.lyquidity.com/exceladdins) which performs an analysis of the workbooks contained in a defined folder hierarchy.

This utility uses the SpecialCells function to count and document the various cell types within sheets in a workbook.