fbpx

Did you know? Pivot Tables can increase the file size!

The problem with large Excel files

Today I was conducting an efficiency optimization session for Risk Management team of a global bank. During the discussion, one of the problems we discussed was large Excel files. Although there are many reasons for large Excel files, one of the reasons is easier to manage and it is largely unknown. It relates to files containing one or more Pivot Tables.

Quick Summary

Pivot Table creates a copy of the source data and saves it in the file. This increases the file size and also slows down the open / close operations. You can ask Excel NOT to save the copy of the data and save on file size. Right click inside pivot, Pivot Table options, Data tab, clear the checkbox Save source data with file. That solves the file size inflation problem.

pivot table increase file size

Read more to understand when to use and when NOT to use this feature.

 

 

The feature that can bug you Devil

When you create a Pivot Table, there is raw data either in the same workbook or it is connected to some external source.

In either case, Pivot Table automatically creates a copy of the ENTIRE DATA. This copy is called a Pivot Table Cache. Before we go into WHY it creates it, just understand the impact.

The file size increases and loading / saving time also increases. Of course the cache does not occupy exactly the same amount of the space occupied by the worksheet raw data. It does compress it. So the file size may not double but it will definitely increase.

To complicate matters further, if you go to the original raw data worksheet every time you create a new Pivot Table, it is creating a fresh, new cache for itself.

So if you have multiple such pivot tables in the same file, the size can grow dramatically.

So is this a feature or a bug?

It’s a feature!

What is the benefit?

If the raw data is right there in the same workbook, there is absolutely no practical benefit of keeping a copy of it inside the Pivot Table.

Now try this. Take some tabular raw data. Save it as a test file. Create a pivot table and create some report. Now delete the sheet containing the raw data. Save the file. Close and Reopen it.

Now try adding more fields to the pivot table. Common sense says that this should not work because the data was removed. But the pivot table cache is still alive. So everything will work well.

Only one thing will NOT work. If you right click in the pivot and choose refresh, you will get an error saying that the data was not available.

It is designed for external connections

If the data comes from external database or connection, then the scene is different. Inside the Excel file we just have the pivot table. No raw data.

In this case, if you are not connected to the database then changing the pivot table will not be possible because there is no data.

Therefore, for our convenience – to allow us to work offline – Pivot Table creates the cache.

So how to reduce the file size

By default, cache is created. It does not matter if the data is in the same workbook or it is coming from some external connection.

So if you have data in the SAME workbook – you can reduce the file size by doing one of the following.

  1. Delete the original data and let pivot table work on the cache
  2. Keep the original data and let the pivot table work by querying it

Deleting the original data is easy. Just delete it.

But how do you delete the pivot table cache?

The answer is – Pivot Table will always work on the cache.

But for your file size optimization, Excel does allow you to control whether the cache is SAVED or not saved with the file.

Right click in the Pivot Table and choose Pivot Table Options

image

Go to Data tab and clear the checkbox Save source data with file.

SNAGHTMLc6eaee8

Now when you save the file pivot table will not increase the file size.

Impact of this option

This is best illustrated in this nice table. I found this in a very old Microsoft Knowledge Base article. This was written for Excel 95. But is still applicable 20 years later!

Action Option On Option Off
Saving Longer Shorter
Opening Same Same
Refreshing Shorter Longer
File Size Larger Smaller

Best Practices

Finally, we can have some best practices.

If there is external data source…

Keep the setting ON. So that you can work with the pivot table even when offline.

If the source data is within the pivot table workbook…

Think. Do you append the data often or is this a one time- snapshot data.

If the source data is static and never needs to be appended (snapshot reports), then delete the original data immediately after Pivot Table is created. Pivot Table will work fine. The Cache will be smaller than the size occupied by the original data. So you will have smaller file size without any functional compromise.

If you append to the source data and refresh the pivots periodically, the situation is different. This is an incremental reporting scenario. In this case, keep the source data and remove the pivot cache.

Share the cache across multiple pivot tables

If you create multiple pivot tables, then it is better to create ONE base pivot table and copy paste it to create different reports / dashboard. This way, the cache is shared across the pivot tables.

This offers another advantage. Refreshing any one of the pivot tables will automatically refresh other pivots.

Keep Cache and remove raw data to create snapshot reports

This is a corollary to the above discussion. If you need to store snapshot reports – as of Jan / Feb / March for example, we usually create a pivot table with that month’s data and then paste it as values. This removes all the analytical functionality of pivot table.

Instead, you should remove the raw data and keep the pivot table alive. This way, the cache represents the data snapshot and you still get full functionality. By removing the raw data, we are ensuring that any accidental or inadvertent append to the data will disturb the snapshot report.

Standard Operating Procedure

Whenever you open an existing file containing pivot table(s)…

Think of the context and apply the correct approach as described above. If there are multiple different pivot tables created from the same data, you have duplicated the caches. Repair those (see below).

Whenever you are about to create a new pivot table …

Take a pause. Think of the data source, business requirement (snapshot or incremental reporting) and choose the correct setting AS SOON AS you create the pivot table.

Pivot Table Duplicate Cache Removal tool

When you open an existing file with lots of pivot tables, you may have a cache duplication problem. I found a very nice tool written by Debra Dalgleish while researching this problem.

Read this article written by her about the programmatic access to the pivot table cache.

http://www.contextures.com/xlPivot11.html

Notice that at VBA level, we have full information as well as control over pivot table caches.

At the bottom of the article, there is a nice utility created by her which checks for duplicate pivot table caches and solves the problem.

Feedback

This is a long post. But I needed to write so much to do justice to the topic. Usually, this would have been a single paragraph article… something like I have written in the Quick Summary above.

But that kind of “tips and tricks” approach does not lead to KNOWLEDGE. It leads to a superficial, knee-jerk kind of approach. It may solve the problem without realizing what the problem was! I do not like that approach.

Yes, it does take up some of your time to read a long article. But if you count the amount of time you are going to save (and the disk space + mailbox space) by virtue of all pivot tables loading and saving faster, the time required to read this article is indeed a good investment

Gift with a bow

9 Responses

  1. Seems like it is best to keep you large excel file in a separate workbook and fire up your pivot table in a new Excel workbook thru the “use an external data source” and when needed to update the PT then just load new data in the initial / separate workbook and refresh the PT in the separate Excel file.

    Thx for this post!

    1. Hi Saul
      Thanks for your comments and feedback.
      You can use External Data Source approach. Another option is to should create a blank file – Connect to the Excel file using Power Pivot data model or Power Query and create Pivot Table. Power Pivot compresses the data very well. Therefore, file size impact is minimal.

  2. Thx again Dr. Nitin for suggesting Power Pivot and Query, add-ins I have been surfing for at least 12 months from home, unfortunately my work environment can’t approve the installation of those add-ins yet. Again this post inspired me a lot, now I am in the process of refining a dashboard with the “External Data Source approach” and refreshing the my pivot tables when new data is added to original data source which is huge file beyond 150K rows. 🙂

    1. and getting data for the charting thru the power of GetPivotDATA function…..:)

    2. Thanks for the feedback Saul Espinoza. I agree that most IT folks don’t know the importance of Power Pivot. That is why they don’t include it along with Office. I will write an article on this very soon. Show it to your IT team.

      I am happy to see that you are utilizing External data and GetPivotData. There is one problem with using GetPivotData. If you want multiple cells, you have to use GetPivotData multiple times. I plan to solve the problem by writing a small but useful macro soon.

      Dr Nitin

  3. Saved me about 70 MB ! Thanks vm! I am going to work on Power BI versions since my Dashboard has soo much data in it. But this saved me tons in between.

Queries | Comments | Suggestions | Wish list