It is very likely if you are into earth sciences, and somewhat likely if you are in some other field of science which deals with large data, that you have heard of the netCDF file format. And if you are someone who frequently manages large data (in netCDF format) produced from your simulations then it is extremely likely that you are going crazy handling the large amounts of data, whilst salivating at every news of a major breakthrough in hard drive technology. I want to discuss some methods here that can reduces the size of netCDF files.
An obvious thing to do to reduce the file sizes is to consciously consider the data that needs to be output to file. It is very common for people, yours truly included, to justify saving all sorts of simulation fields frequently by thinking “well you never know what kind of analysis you might want to do in the future, so it’s best to save as much as possible.” While depending on your field, and the kind of research you are doing etc. that justification might be true, I am willing to bet that most often that extra data just sits on some archival system and never gets analyzed. So, just give yourself some time to think about ….
Okay, back to what I wanted to talk about. I’ll discuss this with regards to a file I have called Data.nc. The fields stored in the file are 2D fields (latitude x longitude) at various points in time. The size of the file is 346 MB. (Yes this is small, but this is an example.)
Now to reduce the file size. I will use the handy nccopy utility that comes with netcdf.
- First thing to do is to remove the unlimited dimension. I run this command at the terminal: nccopy -u Data.nc Data_nou.nc. This took 22 seconds but didn’t do much for the file size. The new size was 345 MB. Even though the file size didn’t change much, this step is essential for the next step.
- Now that the unlimited dimension has been removed, netcdf can do a better job at compressing the data. I will convert the newly created file from the previous step into another file while specifying a deflation level 6 (as was in the original file) and turning on shuffle: nccopy -s -d6 Data_nou.nc Data_final.nc. This step took 16 seconds and the new file is now only 283 MB! That’s a reduction of 18%!
The two steps can be combined into one: nccopy -u -s -d6 Data.nc Data_final.nc. The message is that if the file has an unlimited dimension and it is not going to be used for any more writing along the unlimited dimension then convert that dimension into a regular dimension and then apply compression to save quite a bit of disk space.