Pointers as files

Written on 2022-03-27

Some time ago, I came up with an coarse analogy for pointers using filesystem concepts. While I do not believe that one should ultimately think of pointers as files (they obviously are not and it is not a correct mental model, see the caveats at the bottom of this page), you might find it useful as early teaching material, to help build the first nuggets of intuition in a student's mind.

Files

What's a file? Ignoring metadata, its essentially a lump of data, stored somewhere on some storage medium like a hard drive or an SSD.

There is no universal way of determining what structure that data has, how to interpret it. Sometimes, the reader of the file knows it and decodes it accordingly. Sometimes, that information is stored in the file itself, like is often the case for media files.

To access a path, you need to know its path. You could technically read bits off of a hard drive directly, but it would likely be meaningless to you: how would you even know where a file ends and the next one begins? If you give a path to your filesystem, it will let you access the data that corresponds to that path.

How do you get the path of a file? Well either you somehow know it or it is stored somewhere, like in another file. It should sound fairly obvious that nothing prevents the contents of a file to be a path, but it may also something completely different as well, like a picture or a source file! If you know that it is a file, you can go ahead and read it. It may as well contain anything, maybe the path to another file (or even to itself!). Consider:

# /path/to/some_file
/path/to/another_file

# /path/to/another_file
/path/to/my_message

# /path/to/my_message
Hello, world!

Pointers

The analogy is that in-memory data is like files and pointers are like paths. You can access memory anywhere[1], but if you don't know where to look or how to interpret that memory, you will not get very far.

Compared to paths, pointers are more lightweight. They are in fact just an address, the index at which to access memory. How wide that index is depends on the architecture (64 bits on most platforms these days). So for example, instead of asking the filesystem what data is at path /path/to/my/data, you ask the memory management unit the data that is stored starting at memory cell number 12345 (most often expressed with hexadecimal notation, like 0x3039).

This process is called dereferencing: just like a path references a file, a pointer references a memory cell. Just like you can give any string as a path to your filesystem, you can dereference any 64-bit value, but in both case you won't get any meaningful data out if your input is not valid. What you get out is also undefined (it could be plain data or another path/pointer, like in our previous example). If you know that you are getting a path/pointer out, you can go ahead and dereference it as well if you want.

We have previously established that files may or may not contain paths themselves, it really is up to you to know if you can treat the data in a file as a path. On the other hand, most languages (certainly anything higher level than C) explicitly let you know when a given variable is a pointer or not, and therefore whether it is safe or not to dereference it (that is, look it up as an index, in memory). In other words, you can tell the difference between an int and a pointer to an int, but you cannot tell whether a file contains a path or plain data (extensions are just a convention).

Another way in which pointers are more convenient than paths is that their type encodes what they point to. For example, there is no way for me to tell whether /path/to/my/file contains plain text, an picture or even another path, but I know with png* my_pointer that if my_pointer == 0x1234, then cell 0x1234 in memory is the start of a PNG image. In the end, there is no difference between an int* and a png*, they are both 64-bit indices from the perspective of the computer running the code, but there is a difference in source code and the compiler won't let a program use one in place of the other.
Sometimes, one does not know what is in at a specific address, just like I don't know what lies at /path/to/my/file. This is where void* comes into play, it essentially means "here's where you can find some interesting data, but I don't know what is in there".

Because a pointer is just an address, you can have a look around: to get the next byte after what is at address 0xabc123, just dereference 0xabc124. This is called pointer arithmetic: doing sums, subtractions, multiplications or even division with pointers. This is comparatively simpler to seeking in a file, which requires function calls to the filesystem to get the right place in the file.

Caveats

As mentioned in the introduction, this analogy breaks down quite quickly, so if you intend on using it, keep in mind that it omits that:

Directories, symlinks and other concepts exist in file-land;
Inodes and similar concepts are closer to what a pointer is, but students are rarely familiar with them (and if they are, it is quite likely that they already know what a pointer is);
Files are organised in a tree structure, memory mostly has a flat layout;
Ignoring virtual memory, pointers allow direct access to memory, whereas paths need to be fed to some sort of lookup system;
One decides what path to assign to a file when creating, whereas pointers most often come out of memory allocators, which rarely let you pick the allocation address;
Bad accesses have different behaviors: undefined behavior/segfault/garbage vs structured file access error.

If you are a learner rather than a teacher, I recommend that you look all of this up, those are all interesting and core OS concepts!

Despite those (numerous and significant) caveats, I hope that this analogy can prove useful to other people.

[1] Memory protection and virtual memory are way out of scope here. Jump back