The problem with large Excel files
Today I was conducting an efficiency optimization session for Risk Management team of a global bank. During the discussion, one of the problems we discussed was large Excel files. Although there are many reasons for large Excel files, one of the reasons is easier to manage and it is largely unknown. It relates to files containing one or more Pivot Tables.
Pivot Table creates a copy of the source data and saves it in the file. This increases the file size and also slows down the open / close operations. You can ask Excel NOT to save the copy of the data and save on file size. Right click inside pivot, Pivot Table options, Data tab, clear the checkbox Save source data with file. That solves the file size inflation problem.
Read more to understand when to use and when NOT to use this feature.
The feature that can bug you
When you create a Pivot Table, there is raw data either in the same workbook or it is connected to some external source.
In either case, Pivot Table automatically creates a copy of the ENTIRE DATA. This copy is called a Pivot Table Cache. Before we go into WHY it creates it, just understand the impact.
The file size increases and loading / saving time also increases. Of course the cache does not occupy exactly the same amount of the space occupied by the worksheet raw data. It does compress it. So the file size may not double but it will definitely increase.
To complicate matters further, if you go to the original raw data worksheet every time you create a new Pivot Table, it is creating a fresh, new cache for itself.
So if you have multiple such pivot tables in the same file, the size can grow dramatically.
So is this a feature or a bug?
It’s a feature!
What is the benefit?
If the raw data is right there in the same workbook, there is absolutely no practical benefit of keeping a copy of it inside the Pivot Table.
Now try this. Take some tabular raw data. Save it as a test file. Create a pivot table and create some report. Now delete the sheet containing the raw data. Save the file. Close and Reopen it.
Now try adding more fields to the pivot table. Common sense says that this should not work because the data was removed. But the pivot table cache is still alive. So everything will work well.
Only one thing will NOT work. If you right click in the pivot and choose refresh, you will get an error saying that the data was not available.
It is designed for external connections
If the data comes from external database or connection, then the scene is different. Inside the Excel file we just have the pivot table. No raw data.
In this case, if you are not connected to the database then changing the pivot table will not be possible because there is no data.
Therefore, for our convenience – to allow us to work offline – Pivot Table creates the cache.
So how to reduce the file size
By default, cache is created. It does not matter if the data is in the same workbook or it is coming from some external connection.
So if you have data in the SAME workbook – you can reduce the file size by doing one of the following.
- Delete the original data and let pivot table work on the cache
- Keep the original data and let the pivot table work by querying it
Deleting the original data is easy. Just delete it.
But how do you delete the pivot table cache?
The answer is – Pivot Table will always work on the cache.
But for your file size optimization, Excel does allow you to control whether the cache is SAVED or not saved with the file.
Right click in the Pivot Table and choose Pivot Table Options
Go to Data tab and clear the checkbox Save source data with file.
Now when you save the file pivot table will not increase the file size.
Impact of this option
This is best illustrated in this nice table. I found this in a very old Microsoft Knowledge Base article. This was written for Excel 95. But is still applicable 20 years later!
|Action||Option On||Option Off|
Finally, we can have some best practices.
If there is external data source…
Keep the setting ON. So that you can work with the pivot table even when offline.
If the source data is within the pivot table workbook…
Think. Do you append the data often or is this a one time- snapshot data.
If the source data is static and never needs to be appended (snapshot reports), then delete the original data immediately after Pivot Table is created. Pivot Table will work fine. The Cache will be smaller than the size occupied by the original data. So you will have smaller file size without any functional compromise.
If you append to the source data and refresh the pivots periodically, the situation is different. This is an incremental reporting scenario. In this case, keep the source data and remove the pivot cache.
Share the cache across multiple pivot tables
If you create multiple pivot tables, then it is better to create ONE base pivot table and copy paste it to create different reports / dashboard. This way, the cache is shared across the pivot tables.
This offers another advantage. Refreshing any one of the pivot tables will automatically refresh other pivots.
Keep Cache and remove raw data to create snapshot reports
This is a corollary to the above discussion. If you need to store snapshot reports – as of Jan / Feb / March for example, we usually create a pivot table with that month’s data and then paste it as values. This removes all the analytical functionality of pivot table.
Instead, you should remove the raw data and keep the pivot table alive. This way, the cache represents the data snapshot and you still get full functionality. By removing the raw data, we are ensuring that any accidental or inadvertent append to the data will disturb the snapshot report.
Standard Operating Procedure
Whenever you open an existing file containing pivot table(s)…
Think of the context and apply the correct approach as described above. If there are multiple different pivot tables created from the same data, you have duplicated the caches. Repair those (see below).
Whenever you are about to create a new pivot table …
Take a pause. Think of the data source, business requirement (snapshot or incremental reporting) and choose the correct setting AS SOON AS you create the pivot table.
Pivot Table Duplicate Cache Removal tool
When you open an existing file with lots of pivot tables, you may have a cache duplication problem. I found a very nice tool written by Debra Dalgleish while researching this problem.
Read this article written by her about the programmatic access to the pivot table cache.
Notice that at VBA level, we have full information as well as control over pivot table caches.
At the bottom of the article, there is a nice utility created by her which checks for duplicate pivot table caches and solves the problem.
This is a long post. But I needed to write so much to do justice to the topic. Usually, this would have been a single paragraph article… something like I have written in the Quick Summary above.
But that kind of “tips and tricks” approach does not lead to KNOWLEDGE. It leads to a superficial, knee-jerk kind of approach. It may solve the problem without realizing what the problem was! I do not like that approach.
Yes, it does take up some of your time to read a long article. But if you count the amount of time you are going to save (and the disk space + mailbox space) by virtue of all pivot tables loading and saving faster, the time required to read this article is indeed a good investment