Data During my PhD: Data Structure

Wednesday, February 18, 2026

During my PhD, I encountered two categories of data. The first is project-related data. This data usually lives in some kind of shared folder, so that everyone on the project can explore and contribute. But not everything you do during your PhD directly relates to a project, or needs to be available to colleagues. Where do you put that data?

That’s where the second category comes in: it’s everything else.

This second category, “personal” PhD data, necessarily encompasses a wide range of types of data. Including, but not limited to, personal meeting notes, random collections of notes about topics only tangentially related to your PhD, small experiments, larger experiments, paper PDFs, and the list goes on. Anything, really. I’d even put email and calendar data in this category, if not for the fact that email and calendar programs don’t interface nicely with git.1

Early in my PhD I decided that, at least from a high-level perspective, I wanted to be disciplined regarding this second category of data. Over the course of the four+ years that followed, I refined my approach. One of the outcomes of that is the structure of my personal PhD-stuff folder. Essentially, it’s a shallow hierarchy of folders with some rules about what goes where, and how to name subfolders with some consistency.

The aim of this article is not to be a replacement for scientific data management techniques. Especially if you have large datasets, the ideas I describe in this article won’t be helpful (though they also won’t hurt). I recommend that, for your official final paper sources and datasets, stick to the data management policy as prescribed by your employer. For everything else, the ideas below might be useful for your: pick whatever sounds good, and leave what you don’t like.

The folder structure

Here it is, in alphabetical order:

These are the “top-level folders”, because they’re supposed to be located at the top-level of my research data hierarchy.

There are some top-level folders that I’m not including in the list above. Some of them I are just bad ideas (see the next section), and others are just embarassing boring to talk about, e.g. failed note-taking experiments.

The science/ folder is also a git repository. I’ll probably write a bit more about how I use git for my personal research data in the future, but to summarize it: I commit and push twice a day, and put anything smaller than a few megabytes in this repo. Anything larger goes in my stack folder.

A small remark about the folder name science/: the folder name used to be PhD/, and before that, masterthesis/. Whatever you do: don’t follow the pattern of naming it after whatever it is you are currently doing. Any particular job title is temporary, so just pick something that 1. matches the vibe of whatever you’ll be putting in the folder, and 2. is also generic enough that you can use it in the future for most jobs without renaming. Other good candidates are:

I went with science/ because it felt kinda short even though it really isn’t. It’s also the right level of generic, since most things I put in there are somehow related to computer science.

Alright, let’s go through the top-level folders A to Z. TL;DR: at least check out the descriptions of finalResults/ and experiments/. Of all of the top-level folders, I think these have been the nicest to have. Their necessity is also not 100% evident at first. Or have a look at the last two sections for the executive summary.

admin/

This folder is for mechanical, configuration and plumbing related files. Kind of boring, but necessary. For example, my bibliography system requires a few config files for settings. These files are in the admin/ folder. Another example is the timer.py script. It’s essentially a pomodoro timer, except it hooks into the system notification system to tell you the time’s up. The admin/ folder also contains the dmenu_pubs.sh script from an earlier post about dmenu helpers.

archive/

This is where top-level folders go when I don’t use them anymore, either because they failed to be useful or because they’re no longer relevant. At the time of writing, it contains only one zipped folder, so it’s debatable whether I’ll keep it in the long run. Nevertheless, I still have it because it feels like a useful folder still.

bibliography/

This is where I keep my pubs bibliography repository. The primary reason for me to use pubs is that it provides a reasonably structured and scriptable interface for what is essentially a database for two filetypes: bibtex files and PDFs. See my post on scripting with dmenu for an example of this.

Pubs can also store a plaintext note file for each paper, and I’ve used that a bunch of times. At the time of writing, 35 times. But it never became a habit. I think there was always a bit of a barrier because: opening the terminal and manually typing the identifier of the paper is a bit annoying, editing prose in Vim is not ideal, and there is no way to link those notes to other parts of my data in a meaningful way.

What seems to work well at the moment is putting paper notes in Obsidian. There is no concrete link between Obsidian and pubs (though that’d be cool), but there is an implicit one: the filename of a paper note corresponds directly to an entry in my pubs database. Combined with the shortcut I set up to quickly open PDFs from the pubs database, I can easily browse bibliography notes, while also having quick access to the accompanying PDFs.

More integration and shortcuts would be nice, e.g. opening the corresponding notes file when a certain PDF is open in sioyek, or copying the corresponding bibtex file given the key2, but I’m not sure if I’d use those often so I haven’t put in the effort yet. The only thing that I’m missing is an easier way to add PDFs. Currently it involves a bunch of steps:

  1. Download the PDF
  2. Find the DOI somewhere
  3. Copy the DOI, being carefuly not to select random other characters surrounding it. Especially annoying when I’m using my touchpad and not a mouse.
  4. Open a terminal
  5. Type scib add -D
  6. Paste the DOI in
  7. Type -d ~/Downloads/
  8. Tab tab tab and pray that the PDF filename can be tab-autocompleted

I know there are Python libraries that can scan PDFs for DOIs. Unfortunately, every once in a while I acquire PDFs which definitely don’t have PDFs in them (some publications don’t even have DOIs, e.g. Usenix papers), so there’d need to be some kind of fallback mechanism to include a DOI easily. Or maybe I should just ignore those cases. I haven’t made up my mind yet.

collab/

Here I keep git repositories that are relevant for my current research. I do not use submodules for this. Maybe they’re a great fit, but I always find working with them confusing. Instead, I committed the collab/ folder as an empty folder, and ignore any contents in this folder via .gitignore. This way, whenever I clone my science/ folder, I’m reminded that there are also a bunch of external git repos I should consider.

To manage the git repos in the collab/ folder, I use myrepos, for two reasons:

I think I’ve only ever had at most 3 repos active in the collab/ folder, so I could’ve done without myrepos. Still, it’s nice to have the infrastructure and “paper trail” ready to go.

conferences/

This folder contains a folder for each conference I need to store data for. Any data that relates to the conference, e.g. not just drafts and slides, but also receipts, organizational information, notes I took during presentations, you name it.

A downside is that it overlaps with the projects/ folder a bit. E.g. if you’re writing a draft for a certain conference, it’s not really clear if it should go in the correspoding project or conference folder. Luckily, this is a bit of a nitpick, and doesn’t really matter in practice: I just make the draft whichever folder was there first, and that seems to work fine.

Another downside about this folder is that it feels a bit weird to have a folder for a conference if your submission gets rejected. Having some record of all my attempts compensates for this.

courses/

Not to be confused with teaching/ further down. Here I keep all courses I follow, in contrast with courses I help teach. Anything else vaguely course-shaped, such as summer schools or exercise-heavy workshops, also go in this folder.

Initially this folder contained mostly courses required by the UT PhD programme, but later I also did some courses during my PhD, and also postdoc, purely out of interest and genuine usefulness.

I make sure to put a date in every folder name. E.g. “Academic Publishing Bootcamp (2020-08)/” or “Career College (2025-09)/”. Having the parentheses in there makes it a bit harder to navigate this part of the filesystem in the CLI, but that doesn’t happen often, and usually I can tab-complete my way through it.

Putting dates in the folder names avoids name clashes between duplicate courses, e.g. in case I take courses again later on. This is rare, but it can happen. In the case of one particular course I had to drop out after one day. Since it was mandatory I had to take it again later. I also discovered later that I like having the option of viewing the list in chronological order without having to depend on flaky things like filesystem modified timestamps. This is a recurring pattern in my science/ folder, and a practice I very much recommend.

The course folders do not contain tons of files. It’s just convenient to have a default place to put course materials, notes, slides you might want to reference later, exercises, related book PDFs, etc. Also, since my PhD programme required me to keep & upload certificates of all the courses I followed, it was very handy to have a default place to dump any PDF that might serve as a certificate later. The only downside is that presentation slides can be large files, which is problematic for the git side of things. For now I’m just accepting that I can’t clone my repo on Android devices because of large file sizes.

experiments/

This folder contains all kinds of smaller projects. Some are only tangentially related to my research, others are completely separate and just wound up in this folder because spinning up a separate git repo is not worth it. The key characteristic they all share is that I only work on them for a few days to a few weeks, or maybe a few weekends. If they stay relevant for longer, they should either be upgraded to their own git repo or the projects/ folder.

Here’s a list of typical examples from stuff I’ve put in this folder:

I also put dates in the folder names here, to keep a chronological trail independent of filesystem attributes. Early on in my PhD I braincoded a script ls.py to print the folders in chronological order in accordance with the timestamp in the folder name. It’s good to have, but I rarely feel the need to use it.

Currently there are 138 experiments, which is equivalent to starting a new experiment every 13 days or so since I started my PhD. That feels about right. Have a look at this nice plot:

There are definitely some flat parts of the plot, but on the whole it looks pretty steady. There are 20 experiments I started early in my PhD when I wasn’t putting dates in the folder names yet. In the plot, I put those all in the first month, but that’s not really realistic. It might explain why the early part of the plot is not as steep as the rest of the graph. I’d like to go and put dates into those folders retroactively, but I haven’t found a reliable method to determine those dates yet. I’m sure git can tell me.3

finalResults/

This is my favourite folder of them all. Whenever I complete some significant milestone or project, I put the related files in a dedicated subfolder in finalResults/. I started doing this to keep track of papers and presentations, but now I use it for important things generally: important results related to a publication, conference stuff, proposals, standalone presentations, academic achievements, and custom course material.

The most important benefit of this folder is that whenever I want to reference or send someone a previous result, I can just go into this folder and get a PDF or sources in two clicks or so. Putting everything in here takes some maintenance, but it becomes a valuable resource in the long term.

There are two structures to this folder: the outer structure, which is essentially a naming scheme for milestone folders, and the inner structure, which governs what files are in a milestone folder and their naming.

Outer structure

Each milestone folder follows the following naming scheme:

date - occasion - title or description (result types...)

Here’s a concrete example of what that looks like:

Having all this information in the folder names makes the folder listing pleasant to browse if you’re looking for something. I didn’t start out with this naming scheme, I only started doing this somewhere in 2022 when I noticed I was having a hard time finding previous results.

The folder names can get a bit long, especially if you have a long title and a bunch of result types to add. In practice, this doesn’t matter too much. The longest line I have takes up half my 1080p screen, so there’s even room for an even longer folder name still 🙂.

The parts of the naming scheme are used as follows:

date: a standard YYYY-mm-dd date. This ensures the list also sorts chronologically when sorted alphabetically. This is not an official date or anything, just the date on which I created a milestone folder, or if create multiple milestone folders, dates of the day after as well.

occasion: usually the title of the event or journal, ideally including a year if this makes sense. This is usually the case for conferences and journals, e.g. ETAPS-2022 or iFM-2024. I try to keep this one short.

title: the title of the milestone, or if that doesn’t apply, a short description. Sometimes the occasion is already descriptive enough, in that case I leave the title out.

result types: this is a comma separated list of results that are part of the milestone. I try to reuse types as much as possible, but I’m also not too hesitant to create a new one if it feels right. I have used the following types so far:

Some of these are one-off types. E.g. “working directory files”, which is only used for my masterthesis milestone folder. It includes a bunch of interesting notes and other files that I’d like to keep around. Others appear frequently, e.g. paper and presentation.

Inner structure

Whenever I put something in a milestone folder, I try to approach it from an archivist’s point of view: what do I want to find in this folder in 20 years, and what would be the best form to store it in? For now, for each part of a milestone folder, I try to add the final form (e.g. a finished PDF) and it’s sources (e.g. .tex files), cleaned up to only contain what’s needed to reproduce the final PDF, and nothing more.

If relevant, I include other files. E.g. for papers I typically include a the camera ready version, which I can share freely, and the final published version, which I can’t.

Here’s what that looks like for one of my papers:

The naming scheme here is again fairly rigid: result type - title.extension for files, resultSources/ for folders containing sources for result types, or just result type/ if the result does not consist of one particular file. Again, this is to optimize for browsing and allowing to catch missing files by skimming the content list of each folder. It might not work for everyone, but I like it so far.

projects/

This is a simple one. Each subfolder of projects/ contains all files related to that project. Whenever I have an idea that I think will take a while to explore, or a folder in experiments/ that I’ve been working on for a long time, I make a folder in projects.

If you’d graph out the number of files per project folder, I think you’d get a pretty long tail, in the sense that there are a few projects with most files, and most projects having only a few. Actually, let’s do that:

(I kicked project 32 out of the graph because it had a few git repos in there that artificially inflated the numbers by a few orders of magnitude.)

This reflects kind of what I expected, but then again it’s also kind of different. The red bars are projects that actually led to a concrete output (student report, paper, etc.). The green bars are either ongoing, or, let’s say, no longer promising. In particular, project 19, 28, 29 and 30 became chapters in my thesis. I expected the finished projects to be more heavily biased to the right than they actually are.

Funnily enough, even though I have some nice schemes for putting dates in filenames almost everywhere, in the projects/ folder I don’t do that at all. I’m honestly not sure why not!

The projects/ folder is also where I used to keep files like my daily logbook, my project file with all my high-level notes and tasks per project, and some project-related archive files. A few months ago I moved those to a separate repo, purely for technical reasons: so I can also have a copy on my phone and tablet via git.

reviews/

Here I keep reviews of journal and conference papers, both written by and for me. Its mere presence provides a nice trail of all the reviewing I’ve done and received so far, which is nice to have for when you need to write such things down under the “community service” bullet in your CV. It’s also nice to have a repository of reviews lying around for when you need inspiration for how to start.

The review folders follow the following naming scheme: conference abbreviation with year - review types (date)/. Concretely that looks like this:

So far I’ve only reviewed papers and artifacts, so the review type labels are not so helpful. I still think they look nice.

One thing I didn’t do, but which I wish I did, was keep better track of which review points I addressed in papers I co-authored, and how much I addressed them.

What I would usually do was just paste all reviews in a text file and start working through them top to bottom. Whenever I would be satisfied with the changes for one review point, I’d put “(ok)” in front of it or something similar, and move on to the next. This was a simple and effective way of tracking my progress, and made it possible to pick up where I left off the next day. When I finished addressing reviews, I’d just delete this progress file.

While simple and effective, the downside is that you don’t keep track of what possible holes (and more importantly, their sizes) are still present in your work when you publish a paper. When the day of my defense came, I knew there were still small problems in my papers, but I had a hard time tracking them down. If I just would’ve saved the tracking list, possibly with a 1-line explanation per review point about how I addressed it, that would’ve made re-reading the papers a lot quicker and more effective.

In truth, I have to admit I’m not sure if I could’ve predicted the questions they asked if I did have access to these tracking lists. Nevertheless, reviewer feedback is valuable information, so I suggest you keep track of it, if only with a few words per item of how you tackled it, or if at all.

students/

Every student I supervise gets a folder here. It’s not really a place I do actual work, except on the rare occasion that a student gets stuck and I actually need to do some debugging. In general it’s mostly a dumping ground for files related to the student: meeting notes, emails I want to save, significant outputs (code, patches, reports, etc.), pictures or receipts.

The naming scheme here is to just use the name of the student in question as the folder name. I have not yet had name collision, nor did I supervise one student twice! But those are just name collisions waiting to happen. I should probably start putting dates in these folder names as well.

I will probably rename this to supervision in the near future. “Student” is strongly tied to an academic context, and there’s a good chance the context or people I supervise will change in the long term.

teaching/

Here there’s a folder for each course where I contribute to teaching. Sometimes there are slides in there, communication with students I want to hang on to, or other kinds of notes. Anything related to a course that might be useful later.

One particular recurring file is Points for next year.md. Whenever I encounter something that seems useful to improve, but now’s not the time (as is frequently the case when you’re teaching), I put it in this file. It contains ideas from the entire spectrum from small to large: from small notes on how to do installation of optional tools, to ideas for how parts of the course should be restructured.

There is a certain threshold for these ideas, though: they should be completeable in a few months leading up to the next installment of the course. If they’re smaller than that, I try to apply them anyway, possibly changing lectures or other material I’ve already handed out to students. Next year’s material will be copied from this year anyway, so that way the changes find their way into the course organically. If the change needs more than a few months of prep, or merely needs a longer timeline, I try to put it on my personal task list, or, even better, somewhere in my calendar as an appointment. That way I reduce the chance a bit that the Points for next year.md file becomes just another dumping ground for large projects I won’t feel like doing later on.

The subfolders within teaching/ are dated. E.g.:

(Yes, PP and PPPP are different courses.)

Beyond putting dates in the names, there’s not much of a naming scheme. Ideally, the naming is consistent over the years, so sorting alphabetically also groups courses. But even that is optional when course names and content changes.

Failed folders

Of course, I didn’t just come up with these folders when I made the first commit to my PhD git repo on June 2nd, 2020. I had some expectations about what would work well (e.g. experiments/), but over the years the structure grew mostly organically. This includes some folders that I thought would be nice, but which ultimately turned out to be less useful, or which turned out to be located in the wrong place, and had to be moved.

One pattern I’ve found that fairly accurately predicts if a folder will turn out to be useful or not, is the following: topics bound by time, or not exactly related to a research project, probably shouldn’t be a top-level folder. This makes sense: if they’re not related to research, their purpose will not come up often in my day to day work. If they expire at some point, from then on they will be cluttering the root directory. In either case the folder is better off being moved somewhere else.

Of course, there are exceptions to this rule. For example, for major milestones, it’s nice having them as a top-level folder. E.g. having “defense” be a top-level folder was not only a smart move in terms of being able to easily navigate to it, it also felt motivating. In addition, obviously if you feel that something should be a top-level folder, it’s okay to put it there. That’s how I arrived at most of the structure of my research folder.

Finally, moving a folder around is usually okay if you don’t have tools or processes that depend on the exact path. E.g. the “worst” that happened to me is that I’ve had a few recent file shortcuts break because I’d been moving folders around. Usually, such problems can be solved through reconfiguration, or just re-opening the file in my case.

Here are some of the folders that failed and got, or will be, removed.

🗑️ meetings/

This folder is still a top-level folder, but I don’t use it anymore. I’ve found that, ideally, every meeting should be related to some project, or in other words a short, medium or long-term goal. I’m not saying you shouldn’t have meetings that are not directly related to your day-to-day, but if you keep notes on such meetings and put them in an isolated folder, it’s unlikely you’ll even remember to look for them later.

In the rare case I do have meetings like that, now I just put them as a bullet in my daily logbook. I got this strategy from Jeff Huang’s productivity text file, the only difference being that he puts all his meeting notes in this text file, whereas I only do this with notes from uncategorizable meetings.

Putting notes in my daily log works better than a folder dedicated to meeting notes: if something related comes up in the future, I’m more likely to remember in which period the meeting took place, or at least what I was working on at the time. That’ll help me finding the notes in my daily log. As a small bonus, the notes being in my daily log increases the chance I’ll stumble upon them whan randomly browsing my log file. Both of these benefits are not there if you put notes into inert dated subfolders of meetings/.

Another example are PhD (now postdoc) progress meeting notes. During my PhD progress meetings, we would usually discuss between 1 and 3 of my ongoing research projects, plus students we’d be supervising and other things that happened to be relevant at the time. This made it difficult for me to decide where to put these notes, so it made sense that, at the time, I decided to create a meetings/ folder.

Instead, what I do now is to put such generic work notes into a PhD project (now postdoc) folder in the project/ folder. This work well in practice. I’ve now had several occasions where I wanted to remember something related to a research project or student. The moment I realize it’s something I talked about with one of my supervisors, this folder turned out to be the right place to start looking.

🗑️ jobSearch/

When I was looking for jobs around the end of my contract I needed a place to store a bunch of files and notes around that process, so I made this folder. Later I realized this fits better with my other personal data in my stack folder, where I also store my tax-related files, pictures, etc., as it’s actually a somewhat personal topic and not so technical if you think about it.

🗑️ sites/

At some point I had the idea of designing and deploying my personal website from my PhD folder/git repository. I was also expecting some people around me to also require a small website soon, so I figured, let’s put it into this top-level folder. That need never materialized, and my personal site felt like a serious project that should get its own git repository.

Deploying a research site from your personal research folder is still a good idea, but I think if the need ever comes up again I’d just make it a subfolder of projects/.

These folders match one way of looking at a PhD

I think all of these folders represent important and distinct aspects of daily PhD work. To put it another way: if you’d ask me what kind of activities are generally involved with doing a PhD, I’d give a few answers.

First and foremost, your job is to formulate, then answer, research questions. That is, working out the details and implications of a particular line of research. You will need to do experiments/, most of which will fail, but some of which will grow into longer running projects/. The foundation of these projects are the papers you will collect in your bibliography/, as well as the collab/orations you’ll undertake as you’ll inevitably run into the limitations of not just your field, but yourself as a person.

You’ll probably have to do some teaching/ and grade a bunch of exams when you’d rather be working on a deadline. If you’re lucky, the amount of teaching you’ll have to do will be bounded by 20% or so. If you’re luckier, most of the teaching you’ll be doing in the form of supervising students/. You won’t always be the teacher, though, as most universities have graduate schools where you’ll follow some courses/. Some mandatory, and some because they look genuinely interesting.

As you progress through your PhD, most likely you’ll produce some finalResults/ that you can be proud of: papers, presentations, and other creative expressions of the knowledge you’re accumulating about your niche. You’ll also handle the day to day admin/ work of a PhD: replying to emails, cleaning up your inbox, and clicking the occasional button in the university’s HR webapp. Inevitably, as research directions fail to pan out and the years slip by, some of these endeavours you’ll have to archive/. You realize each archived folder brings you closer to the truth we so frantically pursue for as scholars (and also, publications.)

Finally, you will experience becoming and being part of a community. You will go to conference/s to discover the work of others, and even better, to spread the word of the cool things you’ve discovered. If you’re lucky, you’ll have some nice colleagues with you to show you around and introduce you. However, inevitably, as you go deeper into your niche, you will go to a conference entirely on your own. Imagine that! Going on your own to an event specifically organized for people with your particular interest. How will you ever manage to strike up a conversation out of the blue??? If you’re in computer science, you will most likely find this simultaneously exhilarating and frightening.

As part of the community effort, your supervisor will give you the opportunity to review/ papers every once in a while. This is a good chance to see how other people work, and more importantly, to see the spectrum of quality of work that people from other groups produce. I guarantee you you will be surprised, likely multiple times.

Takeaways

Whew! There you have it, all the folders that structure all files in my day-to-day. It might be a bit much to take in. If there’s anything I think you should take away from this post, it’s not that I think you should exactly copy my system, but this:

Flexibility and personalization is key.

Clearly my needs were not as I initially understood them to be, which required some folders to be renamed, moved, or even deleted. On top of that, my needs changed over time, causing more changes as folders became irrelevant. What’s important is that you learn to recognize the need for these changes, and act upon them when it’s the most convenient and effective to do so. This more or less boils down to, starting early with some kind of structure, and to not be afraid to make changes when it’s not working. Even when you’re on a deadline. Just keep making small changes, and you’ll end up with a nicer structure in the long run.

One last thing I also want to mention is that I don’t think you need an explicit structure like this. I’ve seen plenty of researchers more productive than me who just wing it, so clearly the structure is optional. Some might even say a chaotic approach to managing research data stimulates insight. I’m not sure if I’d go that far.

I hope there are some ideas in this post which you can use to improve your own PhD personal data structure!


View as: md (raw), txt.


Generated with BYOB. License: CC-BY-SA. This page is designed to last.

[ This site is part of the UT webring ]