Revisiting ‘How Many Documents in a Gigabyte?’

I once wrote a column titled “Page Equivalency and Other Fables.” It lambasted lawyers who larded their burden arguments with bogus page equivalencies like, “everyone knows a gigabyte of data equates to a pile of printed pages that would reach from Uranus to Earth.” We still see wacky page equivalencies, and “from Uranus” still aptly describes their provenance.

Back in 2007, I wrote, “It’s comforting to quantify electronically stored information as some number of pieces of paper or bankers’ boxes. Paper and lawyers are old friends. But you can’t reliably equate a volume of data with a number of pages unless you know the composition of the data. Even then, it’s a leap of faith.”

So, I’m happy to point you to some notable work by my friend, John Tredennick. I’ve known John since the emerging technology was fire and watched with awe and admiration as John transitioned from old-school trial lawyer to visionary forensic technology entrepreneur running e-discovery service provider, Catalyst. John is as close to a Renaissance man as anyone I know in e-discovery, and when John speaks, I listen.

Lately, John Tredennick shared some revealing metrics on the Catalyst blog looking at the relationship between data and document volumes, an update to his 2011 article called, How Many Documents in a Gigabyte? John again examines document volumes seen in the data that Catalyst receives and processes for its customers and, crucially, parses the data by file type. As the results bear out, the forms of the data still make an enormous difference in terms of data volume. Even as between documents we think of as being “the same” (like Word .doc and .docx formats), the differences are striking.

For example, John’s data suggests that there are almost 60% more documents in a gigabyte of Word files in the .docx format (7,085) than in a gigabyte of files stored in the predecessor .doc format (4,472). This makes sense because the newer .docx format incorporates zip compression, and text is highly compressible data.

[One exercise I require of the law students in my E-discovery class is to look at the file header of a Word .docx file to note its binary signature, PK, characteristic of a zip-compressed file and short for Phil Katz, author of the zip compression algorithm. For grins, you can change the file extension of a .docx file to .zip and open it to see what a Word document really looks like under the hood. Hint: it’s in XML].

John reports a similar discrepancy between new and old Excel spreadsheet formats (1,883 .xlsx files per gigabyte versus 1,307 for .xls). Here again, the .xlsx format builds in zip compression.

But, the results are reversed when it comes to PowerPoint presentations, with John finding that there are marginally fewer of the newer .pptx files in a gigabyte (505) than the older .ppt format files (580). This makes sense to me because Microsoft phased out the .doc format ten years ago. Since then, presenters have gotten better about adding visual enhancements to deadly-dull PowerPoints, and they tend to add ‘fatter’ components like video clips. The biggest factor is that pictures are highly incompressible, and common image formats (i.e., .jpg images) have always been compressed. Compressing data that’s already compressed tends to increase, not decrease its size.

Wisely, John speaks only of document volumes and makes no effort to project page equivalencies, not even by extrapolating some postulated ‘average-pages-per-file type.’ Anything like that would be as insupportable today as it was when I wrote about it in 2007. Also, when you look at John’s post, note that there is no data supplied concerning TIFF images. I’m not sure why, but I can promise you this: TIFF images are MUCH fatter files, costing far more in terms of storage space and ingestion costs than their native counterparts. Had John added TIFF to the mix, I’m confident his weighted averages would have been much different…and far less useful–much like TIFF images as a form of production. 😉

6 thoughts on “Revisiting ‘How Many Documents in a Gigabyte?’”

Richard Anton said:

January 15, 2014 at 4:41 PM

I wonder if John Tredennick is from Abilene. I went to school with Nick Tredennick, and I think John might be a younger brother. If so, hello to both.

Richard Anton

LikeLike
- John Tredennick said:
  
  January 16, 2014 at 9:00 AM
  
  Actually no but Nick (a renowned early microprocessor chip designer) is a cousin and a great guy. He lives in the San Jose area and has several brothers who are scattered about.
  
  John Tredennick
  
  LikeLike
John Tredennick said:

January 16, 2014 at 9:38 AM

Thanks for your nice words Craig. I am mostly pleased that I can call many of the true e-discovery pioneers friends. And, as Kevin Costner said in Bull Durham, I am happy just to be playing in the game.

Hey, there are two reasons I didn’t focus on TIFF files. First, we believe PDF is a far better format for images than TIFF. Aside from hit highlights, color is a big reason. Color TIFFs are huge in size and JPEG is a terrible substitute. Color PDFs are a fraction of the size and much better for viewing. So we don’t have many TIFFs on our site.

The second reason is that I don’t view TIFF as a native format. Our focus was on the size of native files and we rarely see TIFF as a native file. The PDF files I reported on were native as well. We did not report on the PDF copies of native files on our site.

That said, the size of TIFF files is certainly a relevant inquiry. I can recall looking at that issue in the early 2000’s when all our files were from scanned documents. I believe we found an average of about 37 kilobytes a page (and yes, you could talk about pages in a meaningful way then). We would figure on roughly 25,000 pages to a gigabyte, if memory serves.

Those figures changed dramatically when we moved from typewritten pages to manuals and high resolution documents. We have seen some small or single-page files that are a gigabyte in size.

In any event, I thought these figures would be of interest to our community. I can’t say that they are representative beyond the fact that there is no reason to think that the kind of documents we receive are much different than other vendors. But it would be interesting to have others present their findings as well.

Regards and thanks again for the kind thoughts.

John Tredennick

LikeLike
Pingback: Daily Blog #209 Saturday Reading 1/18/14 : Learn DFIR
Brian Lamb said:

April 11, 2014 at 8:21 AM

There’s another consideration here given a disk’s formatting structure. Most formats of disk file structures deal with a so-called “block” of data. That block of data is usually a fixed size of some small(ish) value like 8, 16, 32, or 64K (kilo or 1000) bytes of data.

if you have a disk format the uses very large blocks but you have data that is very small you waste a lot of space, *because* even a small file will take the whole block. e.g. a 2K text document in a 64K block allocated file system will actually appear to take up the whole 64K (62 of it will be blank data)

Small block sizes on the other hand need a lot more overhead to track what blocks belong to what files, since a *large* file that spans multiple blocks has to be tracked to combine all of those blocks into the end result that you can view or edit or copy or whatever. There is some table or other database that tracks all of that (e.g. FAT for File Allocation Table, or NTFS, or ex2, … etc) for various different file systems.

OK, so then part of this comes down to: If you have a lot of relatively small files then you want a small block size or else you will waste a lot of space on empty parts of those blocks. If you have a lot of big files (say video or audio) you would probably want a larger block size which will be more efficient both in size and speed of access.

Well, this delves deeper into the workings of file systems and such but thought it might be of interest to readers of the article looking for even more info about this kind of stuff.

(some other more in depth discussions of this here too, maybe a little bit much if you are just interesting in file formats like the original article details!)
http://pclt.sites.yale.edu/blog/2010/03/10/disk-block-size

LikeLike
Brian Lamb said:

April 11, 2014 at 8:26 AM

Oh, (and this really may not be my field or forte) but what about PNG? I know TIFF is a classic format for accurate image data… but it clearly has been eclipsed by both lossy and lossless compressed formats right? PDF seems ok to, AFAIK.

LikeLike

Ball in your Court

~ Musings on e-discovery & forensics.

Revisiting ‘How Many Documents in a Gigabyte?’

6 thoughts on “Revisiting ‘How Many Documents in a Gigabyte?’”

Share this:

Related

6 thoughts on “Revisiting ‘How Many Documents in a Gigabyte?’”