Fake CAPTCHA image spelling "unit propagation". Generated with Wolfram Alpha.
The Blog of Bob Rubbens

Data during my PhD: the git repository

Wednesday, July 23, 2025

I stored almost all of my the research data in a single git repo. This encompasses paper drafts, intermediate results, small experiments, etc. The repo does not include collaborations with other researchers; those usually happened in separate shared repos. As part of some kind of informal retrospective on how I structured my PhD research, and for fun, I decided to do some data-sciencing on the git repo’s commit messages. I make no pretense at scientific rigour here: this is all just for fun :-).

For example, below are the 24 most positive commit messages.

  1. interesting interesting (100)
  2. nice improvement! (100)
  3. nice progress! (100)
  4. nice progress (100)
  5. pretty pretty (100)
  6. happy happy (100)
  7. fun fun fun (100)
  8. progress! (100)
  9. nice nice (100)
  10. beautiful (100)
  11. progress (100)
  12. fun fun (100)
  13. pretty! (100)
  14. pretty (100)
  15. cool! (100)
  16. nice! (100)
  17. yay! (100)
  18. nice (100)
  19. woo! (100)
  20. cool (100)
  21. wow (100)
  22. yay (100)
  23. ok! (100)
  24. ok (100)

I acquired these by exporting commit messages from git, doing case folding and then deduplication. For sentiment analysis, I used the well-known NLTK python library. As you can see, there are many commit messages with a 100% score on positivity. To maximize interestingness of the ranking, I break the ties by ranking longer commit messages higher.

These are the top 10 negative commit messages:

  1. tricky tricky (100)
  2. interrupt (100)
  3. scary (100)
  4. funky (100)
  5. ugh (100)
  6. scary scary work (86)
  7. ugh tricky lemma (82)
  8. bad work today! :( (77)
  9. stupid import (77)
  10. scary workday (76)

Here I just took the top 10 because there were not many ties for 100%. I’m surprised “funky” is listed in there. The rest looks accurate.

Here are the most emotional commit messages. This means those with the most emotion going on in all three categories analyses by NTLK: positivity, negativity, and neutrality.

  1. in exceptions paper, remove all leftover old commented tex code. in the rest, add meeting questions for marieke, todos, and logbook entries (100)
  2. moved some stuff from vercors repo into my phd folder. also made a next project file, and did some work on triggers for petra (100)
  3. made bigrat all fancy, for some reason. finished with my smt typing experiment. moved smt names into scopes a bit. (100)
  4. added some while-proofs that use both big-step and small-step semantics. need to move those to a separate file (100)
  5. start drafting fields and such a bit. need to start emitting smt for the heapt type and heaps aggregated type (100)
  6. seems to still work. now to repair the whole date feature requires some work with pandoc metadata… (100)
  7. generation is done. next up, either integration in mill, or test manually directly in adder first (100)
  8. start working on removing the category row from the paper, and change the evaluation cases names (100)
  9. start a bit on the new structure of the zettel, and write one day of the uppsala vacation (100)
  10. some final changes to the exceptions paper that i submitted yesterday. also pubs config (100)
  11. intermediate result, hanoi is solvable, done! counting comes next. looks difficult…! (100)
  12. made the array interface first-class. now only some refactorings of common tasks left (100)
  13. remove slides i will probably never read. add readme to indicate i have these videos. (100)
  14. what next: generic type inference, contracts, a parser, or symbolic execution? (100)
  15. refactored the equivalence into a file, and proven seq_abort and seq_unfold! (100)
  16. partially do casing of sections/subsections/paragraphs. now for chaptertocs. (100)
  17. should look at carbon at some point (and maybe silicon) how they encode maps (100)
  18. start splitting out the solver stuff for a bit more modularization and reuse (100)
  19. add submitted artefact, paper draft around the time artefact was submitted. (100)
  20. restructuring and extending finalresults, work, prepare annual evaluation (100)
  21. finalize logbook for today and remove write paper indentation in projects (100)
  22. move stuff around. do some writing. start on comission composition list. (100)
  23. outline a bit. i think i can work out the one for industry section now (100)
  24. include pdf for jan, and pdf which includes some axioms for summations (100)

As you can see, this category mostly boils down to just a length competition. I was hoping for commit messages that had high ratings in all three of NLTK’s sentiment categories. Unfortunately that’s not the case: the commit messages above all have 100% in only the neutrality category. I leave determining why that is for future work.

Using the dates from the git commit export I made the following bar chart to illustrate how my commit times where distributed over the day. In this chart, commits are bucketed in the hours of the day:

Nothing unexpected: there are peaks around the end of my workday (4pm to 5pm), followed by a dip during dinner time (6pm - 8pm). The second significant peak is around lunchtime, followed by a subtle third peak in the evening (21pm - 22pm). I don’t usually work in the evenings, though it has happened occasionally around paper and thesis deadlines because I suck at following my planning. Instead, I suspect most of my evening commits are due to “hobby commits”. I have a habit of storing my hobby projects in my PhD folder as well; usually those are tangentially related to my research, anyway.

That’s it for now. The code for this analysis is available on sourcehut.