The Evolution of ZIP File, World's Most Important File Formats

We think about files too much, and it's not good for us. Most of our clients send us files and download documents in ZIP format, and ZIP files have been used to transfer data since long before the Web. We use them whenever we need to move business records, email collections, or video game saves from one place to another.

We don't even know we use them. For example, Microsoft Office documents and spreadsheets are ZIP files with different names. ZIPs are also used in the legal field, from the simple packaging of a single draft memo to the massive hundred-gigabyte document production that contains thousands of billable hours of attorney work product.

We often use ZIPs without considering where the format came from or how it works. This shows how flexible, durable, and valuable they are.

Look at it from a sensible point of view: ZIPs put a bunch of files into one big file. This is called archiving, which makes the files (usually) more minor than they would be otherwise. This is called compression. Most of the time, it's easier to move an archive between computers than many files. This is because single files are easier to move, and because computer networks are much slower than computers, a smaller file is more accessible to drive than a bigger one.

We've had a long time to grasp ZIP's concepts.

When electronic computers were new a long time ago, there were no rules or standards for separating different data entries or records from each other. Computers became essential tools after World War II when businesses and government agencies had to deal with a massive amount of new scientific, economic, political, and legal data.

How Computer "Files" Came to Be

In the past, manila folders, paperclips, and filing cabinets were used to store, search for, and send information. In 1960, a file system wasn't a piece of software that kept the photos your aunt just sent you from showing up in your latest secret reply to that annoying motion to dismiss. Instead, it was a set of rules for how people stored and looked for paper documents in a metal file cabinet or on a shelf. Most people knew at least something about how their local library kept track of books and where they were.

So, computer designers and engineers used some familiar patterns to make it easier for people to live and work in a suddenly digital world. They used the idea of a filing system to set up a way for machines to store electronic records.

ERMA, which Bank of America made in 1955 to handle the increasing number of checks their wealthy customers were writing, was one of the first computer file systems that did well. IBM and AT&T's Bell Labs were the companies and government agencies that did the same. By 1975, most mass-produced computers had an electronic file system for keeping track of files.

With more file systems came more ways to name, organize, and describe files, as well as a strong need to move the information in files between computers or onto tapes for storage.

Business deals worth billions of dollars and government projects all of a sudden depended on whether or not millions of files of data from one organization could be moved into the computers of the other organization. It seems silly to worry about these things now, but back in the days of the mainframe, they were huge problems with no clear answers.

Early archive formats chart the contents of files and some standard metadata, like their names, locations in the file system, and the times they were created and last used, into a single file. Anyone who knew the scheme could decode the archive's files and add them to a different file system, removing any information that didn't make sense in the new place for the files. The name of the oldest scheme still widely used, TAR, hides the fact that it used to stand for "Tape ARchive."

The size of the single files these early archivers made was always a problem. If you make an archive of every file on a computer, that single file will be at least as big as all the files on the computer.

How to solve the problem of data compression

Data transfer and storage have always been the time-consuming and most expensive parts of computer operation, so making data smaller has always been very profitable.

Before the AI craze, people were crazy about data compression.

And just like AI, data compression is thought of as a mysterious art, even by software engineers. Psychoacoustic modeling and differential phase-shift keying are terms that engineers in the field use all the time. But the truth is that many core ideas are straightforward to understand. For example, it takes fewer characters to write the words "sixty quadrillions" than the number "60,000,000,000,000,000." If that number appears a thousand times in a document, you'll save a lot of ink and hand cramps by writing the two words instead of the twenty-two digits and commas.

The MacGuffin of HBO's hit show Silicon Valley is data compression, and for a good reason: many millionaires in their 50s got rich because they knew about things like arithmetic coding and Burrows-Wheeler transformation. By the early 1980s, data compression engineers had become rock stars in the software industry, thanks to their strange statistical tricks and low-level programming spells.

How Phil Katz came up with the ZIP file

A man from Milwaukee named Phil Katz was one of the most influential people. If you don't know his name, you might be surprised to find out that you've been saying his initials unconsciously for more than 20 years.

In the middle the to late 1980s, PCs became a big part of American business and education, and more and more data was made with them. In a flash, fixed disks went from an expensive luxury to an absolute must-have. Every type of software came in a box with floppy disks, and office supply stores started selling floppies directly to consumers in bulk. In 1986, a high-capacity 5-1/4" floppy disk cost around $4, which is about $10 today. Investors and computer fans would listen to anyone who could reliably cut the use of floppies.

Katz started a software compression company called PKWare the same year he had problems with an employer he thought didn't value his skills enough. He made a program that combined the functions of archiving and compression. This program made single files that both put together several files and shrunk them down, so they took up much less space.

Katz knew more about computers than most people, and the compression program he made was faster and better than the program that came before it, which was called ARC. He was going to call it PKARC at first, but a trademark dispute forced him to come up with a new name that would fit within the three-letter limit for file extensions back then. The story goes that he heard someone praise how "zippy" the software worked, so he chose the name ".zip."

A few years later, PKZIP became the standard way to compress and store files on PCs.

ZIP joined the likes of.DOC,.PDF, and.TXT in the hall of fame.

ZIP's success was due to two elements.

First, the PKZIP program was shared as shareware, which meant that there was a free version that people could give away with their compressed archives so that anyone could open them without having to buy a copy of the software.

Second, perhaps most importantly, Katz gave out every copy of the software along with the ZIP standard. This is the rule book for how to make and read ZIP files. This meant that any software engineer who knew enough about the field could write their software to work with the ZIP format, and many did.

The ZIP standard includes a "record separator," a pattern of data that tells the computer when one compressed file ends and the next one begins. Every file in a ZIP has the separator at least twice. It doesn't matter what's in the division, as long as it's consistent, and Katz just used his initials, PK. Sure enough, if you open a ZIP file with a program that lets you look at the file's raw contents (sometimes called a "hex editor"), you'll see "PK" everywhere.

Back to Blog