Process Recorder - psrec

I’ve been progressively trying to learn the Rust programming language for around a year and a half now and as well as porting some of my existing C/C++ apps I have to Rust as learning exercises (but with the C/C++ versions still mostly being the main ones for the moment) I have also started to write some new from-scratch apps in Rust when I think that makes sense, rather than defaulting to C/C++ as I previously would have.

I’d recently had the need to record some basic app process stats (CPU usage and RSS memory usage) over the duration of its running, and while there are existing applications out there - i.e. psrecord - that would largely do what I wanted, I was tempted to try and write my own version in a “native” language - at least for the recording part: the plotting / visualisation part is a bit more tricky. This was partly so as to have another small project with which to gain more experience with Rust, but also because I wanted additional features like the ability to control whether to have “normalised” or “absolute” CPU usage, and the ability to separate CPU time into user and system time - so I’ve written an initial version of psrec which is my own equivalent to psrecord written in Rust.

This initial version is pretty basic so far, and doesn’t yet support all the additional features I wanted: I’m making use of the psutil crate to extract the process information for the moment, but its functionality is incomplete as it doesn’t support extracting info on a process’ children, or info like the number of active threads a process has, which is functionality I will likely want at some point, so I’m no doubt going to have to get down in the weeds with the /proc/<pid>/ file system interface, which can sometimes seem a bit primitive and messy in my experience (it would be nice to have a first-class API to access the data efficiently, rather than doing string parsing, although it does make things very visual and easy to debug).

For the moment, I’m using Python and matplotlib to plot the resulting data, which to some extent is a bit at odds with writing the main app in a compiled language like Rust, but I think it’s okay for the moment, and it allows the recording app to theoretically be more efficient and low-overhead (although polling the /proc/ file system and recording samples every second isn’t really that much overhead), whilst using Python for things it’s very good at.

Below is a basic example of the chart plot of a quick process recording:

Process recording example



Fast Log Timestamp parsing for Timestamp delta checking

A year ago I started writing a new basic app (Sniffle) to find log files based off directory/filepath patterns, allowing for recursive directory pattern wildcard matching, and then performing file content operations on the found files, mainly involving grepping/counting items, with an emphasis on finding files on NFS networks. Existing applications like grep, ack and ag (amongst others) do provide existing functionality for some of the use cases, but their defaults are generally wrong for my use-cases (i.e. I want symlinks to always be followed, and to do recursive searching by default), and some of the methods they use to be fast are not always directly compatible (at least efficiently) with NFS networks (i.e. mmap-ing files for better content searching performance).

The number of logs I’d want to search through was often in the tens of thousands, and some of the logs are (unfortunately, and often needlessly) very verbose and can be > 5 MB in some cases. Using a log database or some other similar infrastructure would generally be the more principled way of solving this issue at scale, but I’m generally the only person who occasionally needs to perform these searches, so getting dedicated infrastructure (i.e. a database) for this use-case from IT was very unlikely, essentially meaning I ended up writing my own solution at home in my own time for use at work.

I’ve had working support for finding files, grepping through those files, counting string matches in files and finding multiple cascading strings (optionally in particular orders - i.e. if the first one isn’t found, there’s no point looking for any of the others) for a while now, as well as the ability to filter the list of found files to grep/process first by modified date (e.g. log files last modified within the past 5 days) or file size threshold, but a new use-case came up at work recently where I wanted to look for time period gaps in the log timestamps, indicating that the process writing the logs (a renderer) either hadn’t been doing anything for a while, or had taken longer than expected to do some task.

Checking the time delta between timestamps on consecutive lines in a file is pretty trivial to implement at a conceptual level, however doing so in a way which is performant is a bit more work: naively using strptime() to parse the time has a significant and noticeable overhead, due to internal use of mktime() which is very expensive, and even using sscanf() to pull out the constituent parts and rebuild them, or using a series of atoi() calls - while noticeably faster than strptime() - can be improved upon if you’re just extracting digit components in known positions (although supporting multiple date-formats complicates this slightly).

Given a fixed and valid timestamp string with consistent width zero-padding - which I could validate and guarantee in my use-case - I ended up settling on subtracting each string character value per timestamp unit digit by the char value '0', to give the 0-based 0-9 integer value, and then multiplying it by its digit position in the component number and adding these values together grouped by component number, effectively extracting and accumulating final values very quickly.

It’s somewhat hacky and verbose, but demonstrably faster than the more normal approaches mentioned above, which for my purposes made it a worthwhile optimisation.

As an example, given a std::string currentString; representing a non-empty log line which is guaranteed to have at least the minimum number of string characters for an ISO 8601 date/time stamp, and size_t timestampStart representing the character position offset within the string where the timestamp starts (in order to support varying formatting around the timestamp), with the start of a log line looking something like this:

[2019-03-22 09:42:13] Did something useful

then code to parse the year from the string looks like this:

uint64_t yearVal = (currentString[timestampStart] - '0') * 1000;
yearVal += (currentString[timestampStart + 1] - '0') * 100;
yearVal += (currentString[timestampStart + 2] - '0') * 10;
yearVal += (currentString[timestampStart + 3] - '0');

Extracting the month and day like this:

uint64_t monthVal = (currentString[timestampStart + 5] - '0') * 10;
monthVal += (currentString[timestampStart + 6] - '0');

uint64_t dayVal = (currentString[timestampStart + 8] - '0') * 10;
dayVal += (currentString[timestampStart + 9] - '0');

and handling the time components is done similarly with appropriate position indices.

Using these component integer values to represent a full time value is now context-dependent on what’s trying to be achieved: if you just wanted to sort the timestamps, you could just accumulate the numbers, multiplying each component by a component-respective multiplier to build the number of seconds, i.e.:

uint64_t currentTime = (yearVal * 365 * 31 * 24 * 60 * 60) + (monthVal * 31 * 24 * 60 * 60);
		currentTime += (dayVal * 24 * 60 * 60) + (hourVal * 60 * 60) + (minuteVal * 60) + secondVal;

In other words, using a constant of 31 for the number of days in all months: because we’d only care about relative positions based on numbers for sorting, and not absolute deltas, it wouldn’t be necessary to use the correct number of days in the month.

However, for the use-case of working out time durations between each log timestamp, absolute delta values are required, and this does involve knowing the number of days within each month - so that you accurately know the difference between 23:55 on the last day of a month is 10 minutes before 00:05 on the first day of the month, which makes things a bit more complicated. I ended up using two pre-calculated static const arrays of the months, one for non-leap year years, the other for leap year years, i.e.:

// pre-calculated totals for numbers of days from start of year for each month
static const unsigned int kCumulativeDaysInYearForMonth[12] =			{ 0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334 };
static const unsigned int kCumulativeDaysInYearForMonthLeapYear[12] =	{ 0, 31, 60, 91, 121, 152, 182, 213, 244, 274, 305, 335 };

and then working out whether the year was a leap year by cheating a bit and only checking the first timestamp on the first line of the logfile to see if it was a leap year, and then caching that in a variable for the rest of the log processing:

// exactly divisible by 400, not exactly divisible by 100
const bool isLeapYear = ((currentYear % 400 == 0) ||
                        (currentYear % 100 != 0)) && (currentYear % 4 == 0);
const unsigned int* pCumulativeDaysInMonth = (isLeapYear) ?
                        kCumulativeDaysInYearForMonthLeapYear : kCumulativeDaysInYearForMonth;

which then meant not needing to do any branching for any of the remainder log line timestamps to work this out, and being able to build up the number of seconds a timestamp represented with:

const unsigned int numDaysSinceStartOfYearToMonth = pCumulativeDaysInMonth[monthVal - 1];

uint64_t currentTime = (yearVal * 365 * 31 * 24 * 60 * 60) + (numDaysSinceStartOfYearToMonth * 24 * 60 * 60);
currentTime += (dayVal * 24 * 60 * 60) + (hourVal * 60 * 60) + (minuteVal * 60) + secondVal;

which then works to provide exact absolute counts of the number of seconds the timestamp represents for nearly all situations, except for two:

  • The first line of the log file having a timestamp almost two months before the end of February, e.g. December 30th, and some following log line timestamps in the same log file then progressing on to the end of February, potentially meaning there was a discrepancy between the code thinking the year was a leap year or not
  • Daylight Savings Time changes

Both of these situations in the end I decided to ignore: the first one because it was a total non-issue for my use-case, as if the length of the log files timestamps stretched to almost two months, straddling a new year and then on to the end of February, there would be much bigger issues to worry about: almost all log files should have had a duration of under 48 hours, and anything over two weeks long would be a pathological situation. For the Daylight Savings Time change, while it was a definite potential issue that could happen - either adding an extra hour or removing it from the time value in the hour after the change - which could then have tripped or not tripped the time delta threshold logic incorrectly, I was happy to let that problem slide: dealing with DST changes in computer systems is almost always problematic in my experience (especially if using local timestamps like here), and while it was technically solvable if you know the dates it changes (per geographic location: different countries change at differing times during the year, and the Earth hemisphere matters as well for direction), I just didn’t feel it was worth it for the potential of some false positives/negatives happening within two to three hours per year.




Archive
Full Index

2024 (6)
2023 (7)
2022 (3)
2021 (5)
2020 (4)
2019 (7)
2017 (1)
2016 (2)
2015 (1)
2014 (9)
2013 (10)
2012 (7)


Tags List