Erik Naggum
Naggum
Software
Oslo, Norway
1999-10-11
The programming language Common Lisp offers a few functions to support the concept of time as humans experience it, including GET-UNIVERSAL-TIME, ENCODE-UNIVERSAL-TIME, DECODE-UNIVERSAL-TIME, and GET-DECODED-TIME. These functions assume the existence of a timezone and a daylight saving time regime, such that they can support the usual expression of time in the environment in which a small number of real-life applications run. The majority of applications, however, need more support to be able to read and write dates and times, calculate with time, schedule events at specific clock times daily, and work with several time zones and daylight saving time regimes. This paper discusses some of the problems inherent in processing time suitable to humans and describes a solution employed by the author in a number of applications, the LOCAL-TIME concept.
For instance, everyone knows which century they are in or that some two-digit year refers to. Until computers came along, the assumptions held by people were either recoverable from the context or shared by contemporary communicators. After computers came to store information for us, we still held onto the context as if the computers were as able to recover it as we are. Quite obviously, they aren't, and in about three months, we will see whether other humans were indeed able to recover the context left unstated by other humans when they wrote down their dates with two digits and assumed it would never be a problem. The infamous Y2K problem is one of the few opportunities mankind will get to tally the costs of lack of precision in our common forms of communication. The lesson learned will not be that our notations of time need to be precise and include their context, unless the general public stops refusing to be educated in the face of dire experience. That so much attention has been granted this silly problem is fortunate for those of us who argue against legacy notations of time. However, the inability of most people to deal with issues of such extraordinary importance when they look "most harmless" means that those who do understand them must be inordinately careful in preparing their information such that loss of real information can be minimized.
The basic problem with time is that we need to express both time and place whenever we want to place some event in time and space, yet we tend to assume spatial coordinates even more than we assume temporal coordinates, and in the case of time in ordinary communication, it is simply left out entirely. Despite the existence of time zones and strange daylight saving time regimes around the world, most people are blithely unaware of their own time zone and certainly of how it relates to standard references. Most people are equally unaware that by choosing a notation that is close to the spoken or written expression of dates, they make it meaningless to people who may not share the culture, but can still read the language. It is unlikely that people will change enough to put these issues to rest, so responsible computer people need to address the issues and resist the otherwise overpowering urge to abbreviate and drop context.
This paper is almost all about how we got ourselves into trouble by neglecting to think about time frames longer than a human lifetime, how we got all confused by the difference between time as an orderly concept in science and a mess in the rest of human existence, and how we have missed every opportunity to fix the problems. This paper proposes a fix to the most glaring problems in a programming language that should not have been left without a means to express time for so long.
Scientific time also lends itself to ease of computation; after all, that is what we do with it. For instance, we have a world-wide standard for time, called the Coordinated Universal Time, or UTC. (The C used to be subscripted, UTC, just like the digits in UT0 and UT1, which are universal time concepts with slightly different reference points, but "UTC" has become the preferred form.) Scientific time naturally has origin 0, as usual with scientific measures, even though the rest of human time notations tend to have origin 1, the problems of which will be treated below.
Most computer-related references to time deal with periods of time, which lend themselves naturally to use scientific time, and therefore, it makes sense to most programmers to treat the period of time from some epoch until some other time to be the best way to express said other time. This is the path taken by Common Lisp in its UNIVERSAL-TIME concept, with time 0 equal to 1900-01-01 00:00:00 UTC, and the Unix time concept, with time 0 equal to 1970-01-01 00:00:00 UTC. This approach works well as long as the rules for converting between relative and absolute time are stable. As it turns out, they are not.
Not all languages and operating systems use this sensible an approach.
Some have used local time as the point of reference, some use decoded local time
as the reference, and some use hardware clocks that try to maintain time
suitable for direct human consumption. There is no need to make this issue
more complex than it already is, so they will not be granted any importance.
Political time is closely related to territory, power, and collective human irrationality. There is no way you can know from your location alone which time zone applies at some particular point on the face of the earth: you have to ask the people who live there what they have decided. This is very different from scientific time, which could tell you with great ease and precision what the mean sidereal time at some location should be. In some locations, this is as much as three hours off from what the local population has decided, or has had decided for them. The Sun is in zenith at noon at very few places on earth, instead being eclipsed or delayed by political decisions where the randomness never ends.
Yet, it is this political time that most people want their computers to produce when they ask for the date or the time of day, so software will have to comply with the randomness and produce results consistent with political decisions. The amount of human input into this process is very high, but that is the price we have to pay for our willingness to let politicians dictate the time. However, once the human input has been provided, it is annoying to find that most programming languages and supporting systems do not work with more than one timezone at a time, and consequently do not retain timezone information with time data.
The languages we use tend to shape the ideas we can talk about. So,
too, the way we write dates and times influence our concepts of time, as they
were themselves influenced by the way somebody thought about time a long time
ago. Calendars and decisions like which year is the first, when the year
starts, and how to deal with astronomical irregularities were made so long ago
that the rationale for them has not survived in any form, but we can still look
at what we have and try to understand. In solving the problem of dealing
with time in computers, a solid knowledge of the legacy we are attending to is
required.
Not only do we omit information that is deemed redundant, it is not uncommon for people to omit information out of sheer laziness. A particularly flagrant example of the omission of information relative to the current time is the output from the Unix ls program which lists various information about files. The customary date and time format in this program is either month-day-hour-minute or month-day-year. The cutoff for tolerable precision is six months ago, which most implementations approximate with 180 days. This reduction in precision appears to have been motivated by horizontal space requirements, a necessary move after wasting a lot of space on irrelevant information, but for some reason, precision in time always suffers when people are short of space.
The infamous Y2K problem, for instance, is said to have started when people wanted to save two columns on punched cards, but there is strong evidence of other, much better alternatives at the time, so the decision to lose the century was not predicated on the need for space, but rather on the culturally acceptable loss of information from time coordinates. The details of this mess are sufficiently involved to fill a separate paper, so the conclusion that time loses precision first when in need or perceived need of space should be considered supported by the evidence.
Using names for numeric entities complicates processing a natural language specification of time tremendously, yet this is what people seem more comfortable with. In some cultures, months have only names, while in others, they are nearly always written as numbers. The way the names of months and the days of the week are abbreviated varies from language to language, as well, so software that wants to be international needs to maintain a large repository of names and notations to cater to the vanity of human users. However, the names are not the worst we have to deal with in natural language notations.
Because dates and times are frequently spoken and because the written forms are often modeled after the spoken, we run into the problem of ordering the elements of time and the omission of perceived redundancy becomes a much more serious problem, because each language and each culture have handled these problems so differently. The orders in use for dates are
Time is fortunately specified with a uniform hour-minute-second order, but the assumption of either AM or PM even in cultures where there is no custom for their specification provides us with an ambiguity that computers are ill equipped to deal with. This and other historic randomness will be treated in full below.
Most of the time people refer to is in their immediate vicinity, and any system intended to capture human-friendly time specifications will need to understand relative times, such as "yesterday", "this time tomorrow", "two hours ago", "in fifteen minutes". All of these forms vary considerably from culture to culture and from language to language, making the process of reading these forms as input non-trivial. The common forms of expression for periods of time is also fuzzy in human communication, with units that fail to convert to intervals of fixed length, but instead are even more context-sensitive than simple points in time.
Obviously, a language-neutral notation will have to consist of standardized elements and possibly codes. Fortunately, a standard like this already exists: ISO 8601. Since all the work with a good language-neutral notation has already been done, it would be counter-productive in the extreme to reinvent one. However, ISO 8601 is fairly expensive from the appropriate sources and also chock full of weird options, like most compromise standards, so in the interest of solving some problems with its use, only the extended format of this standard will be employed in this paper.
A language-neutral notation will need to satisfy most, if not all, of the needs satisfied by natural language notations, but some latitude is necessary when dealing with relative times -- after all, the purpose of the language-neutral notation is to remove ambiguity and make assumptions more if not completely explicit. ISO 8601 is sufficient to cover these needs:
The full, extended format of ISO 8601 is as follows:
1999-10-11T11:10:30,5-07:00The elements are, in order:
Every element in the time specification needs to be within the normal bounds. There is no special consideration for leap seconds, although some might want to express them using this standard.
A duration of time has a separate notation entirely, as follows:
P1Y2M3DT4H5M6SThe elements are, in order:
P7W
A period of time is indicated by two time specifications, at least one of which has to be absolute, separated by a single solidus (slash), and has the general forms as follows:
start/endthe end form may have elements of the date omitted from the left with the assumption that the default is the corresponding value of the element from the start form. Omissions in the start form follow the normal rules.
start/duration
duration/end
The standard also has specifications for weeks of the year and days of the week, but these are used so rarely and are жsthetically displeasing so are gracefully elided from the presentation.
When discussing the read/write syntax of the LOCAL-TIME concept below, the above formats will be employed
with very minor modifications and extensions.
This page was updated 7/10/99 2:00 AM.This piece of information is amazingly useless, yet obviously not so to the person who knows where the machine is located and who wrote it in the first place. Only by monitoring for changes to this statement does it have any value at all. Specifications of time often has this purpose, but the belief that they carry information, too, is quite prevalent. The only thing we know about this time specification is that it was made in the past, which may remove most of the ambiguity, but not quite all -- it could be 1999-07-10.
The geographical origin of a time specification is in practice necessary to understand it. Even with the standard notation described above, people will want to know the location of the time. Unfortunately, there is no widely adopted standard for geographical locations. Those equipped with GPS units may use ICBM or grid coordinates, but this is almost as devoid of meaning as raw IP addresses on the Internet. Above all, geography is even more rife with names and naming rules that suffer from translation than any other information that cries for a precise standard.
Time zones therefore double as indicators of geographical location, much to the chagrin of anyone who is not from the same location, because they use names and abbreviations of names with local meaning. Of course. Also, the indication of the daylight saving time in the timezone is rather amusing in the probably unintentional complexity they introduce. For instance, the Middle or Central European Time can be abbreviated MET or CET, but the "summer time" as it is called here is one of MEST, CEST, MET DST, or CET DST. Add to this that the "S for summer" in the former two choices is often translated, and then we have the French.
The only good thing about geography is that most names can be translated into
geographical coordinates, and a mapping from coordinates to time zone and
daylight saving time rules is fairly easy to collect, but moderately difficult
to maintain. This work has been done, however, and is distributed with
most Unix systems these days, most notably the free ones, for some value of
"free". In order for a complete time representation to work fully with its
environment, access to this information is necessary. The work on the
LOCAL-TIME concept includes an interface to the
various databases available under most Unix systems.
When dealing with a particular time, it is therefore necessary to know, or to be told, whether it refers to the past or the future, and whether the vantage point is different from the present. If, for instance, a delivery is due "10/15/99", and it fails to be delivered that day, only a computer would assume that it was now due 2099-10-15. Unfortunately, there is no common practice in this area at all, and most people are satisfied with a tacit assumption. That is in large part what caused the Y2K problem to become so enormously expensive to fix. Had the assumed, but now missing information been available, the kinds of upgrades required would have been different, and most likely much less expensive.
There is more to the perspective than just past and future, however. Most computer applications that are concerned with time are so with only one particular time: the present. We all expect a log file to be generated along with the events, and that it would be disastrous if the computer somehow recorded a different time than the time at which an event occurred, or came back to us and revised its testimony because it suddenly remembered it better. Modern society is disproportionately dependent on a common and coordinated concept of the present time, and we have increasingly let computers take care of this perspective for us. Telephones and computers, both voice and electronic radio broadcasts, watches, wall clocks, the trusty old time clocks in factories where the workers depended on its accuracy, they all portray this common concept of a coordinated understanding of which time it is. And they all disagree slightly. A reportedly Swiss saying goes: "A man with one clock knows the time. A man with two clocks does not."
Among the many unsolved problems facing society is an infrastructure for
time-keeping that goes beyond individual, uncoordinated providers, and a
time-keeping technology that actually works accurately and is so widely
available that the differences in opinion over what time it is can be resolved
authoritatively. The technology is actually here and the infrastructure is
almost available to everyone, but it is not used by the multitude of purported
sources of the current time. On the Internet, NTP (the Network TIme Protocol) keeps fully connected
systems in sync, and most telecommunications and energy providers have amazingly
accurate clocks, but mere mortals are still left with alarming
inaccuracies. This fact alone has a tendency to reduce the interest in
accurate representation of time, for the obvious reason that the more accurate
the notation and representation, the less trustworthy the value expressed.
Less obvious is the problem of adding one day to a particular time of day. This was the original problem that spurred the development of the LOCAL-TIME concept and its implementation. In brief, the problem is to determine which two days of the year the day is not 24 hours long. One good solution is to assume the day is 24 hours long and see if the new time has a different timezone than the original time. If so, add the difference between the timezones to the internal time. This, however, is not the trivial task it sounds like it should be.
The first complication is that none of the usual time functions can report the absolute time that some timezone identifier will cause a change in the value of timezone as applicable to the time of day. Resolving this complications means that we do not have to test for a straddled timezone boundary the hard way with every calculation, but could just compare with the edge of the current timezone. Most software currently does this the hard way, including the Unix "cron" scheduler. However, if we accept the limitation that we can work with only one timezone at a time, this becomes much less of a problem, so Unix and C people tend to ignore this problem.
The second complication is that there really is no way around working with an internal time representation in any calculation -- attempts to adjust elements of a decoded time generally fail, not only because programmers are forgetful, but also because the boundary conditions are hard to enumerate.
Most often, however, calculations fall into two mutually exclusive categories:
The Roman tradition of starting the year in the month of March has also been lost. Most agrarian societies were far more interested in the onset of spring than in the winter solstice, even though various deities were naturally celebrated when the sun returned Most calendars were designed by people who made no particular effort to be general or accurate outside their own lifetime or needs, but Julius Cжsar decided to move the Roman calendar back two months, and thus it came to be known as the Julian calendar. This means that month number 7, 8, 9, and 10 suddenly came in as number 9, 10, 11, and 12, but kept their names: September, October, November, December. This is of interest mostly to those who remember their Latin but far more important was the decision to retain the leap day in February. In the old calendar, the leap day was added at the end of the year, as makes perfect sense, when the month was already short, but now it is squeezed into the middle of the first quarter, complicating all sorts of calculations, and affecting how much people work. In the old days, the leap day was used as an extra day for the various fertility festivities. You would just have to be a cжsar to find this unappealing.
The Gregorian calendar improved on the quadrennial leap years in the Julian calendar by making only every fourth centennial a leap year, but the decision was unexpectedly wise for a calendar decision. It still is not accurate, so in a few thousand years, they may have to insert an extra leap day the way we introduce leap seconds now, but the simplicity of the scheme is quite amazing: a 400-year cycle not only starts 2000-03-01 (as it did 1600-03-01), it contains an even number of weeks: 20,871. This means that we can make do with a single 400-year calculation for all time within the Gregorian calendar with respect to days of week, leap days, etc. Pope Gregory XIII may well have given a similar paper to this one to another unsuspecting audience that probably also failed to appreciate the elegance of his solution., and 400 more years will pass before it is truly appreciated.
Other than the unexpected elegance of the Gregorian calendar, the world is
now quite fortunate to have reached consensus on its calendars. Other
calendars are still used, but we now have a global reference calendar with
complete convertibility. This is great news for computers. It is
almost as great news as the complete intercurrency convertibility that the
monetary markets achieved only as late as 1992. Before that time, you
could wind up with a different amount of money depending on which currencies you
traded obscure currencies like the ruble through. The same applied to
calendars: not infrequently, you could wind up on different dates according as
you converted between calendar systems, similar to the problem of adding a year
to February 29 any year and then subtracting a year.
Bignum operations are generally far more expensive than fixnum operations, and they have to be, regardless of how heavily the Common Lisp implementation has optimized them. It therefore became a pronounced need to work with fixnums in time-intensive applications. The decision fell on splitting between days and seconds, which should require no particular explanation, other than to point out that calculation with days regardless of the time of day is now fully supported and very efficient.
Because we are very close to the beginning of the next 400-year leap-year cycle, thanks to Pope Gregory, day 0 is defined to be 2000-03-01, which much less arbitrary than other systems, but not obviously so. Each 400-year cycle contains 146,097 days, so an arbitrary decision was made to limit the day to a maximal negative value of -146,097, or 1600-03-01. This can be changed at the peril of accurately representing days that do not belong to the calendar used at the time. No attempt has been made to accurately describe dates not belonging to the Gregorian calendar, as that is an issue resolvable only with reference to the borders between countries and sometimes counties at the many different times throughout history that monarchs, church leaders, or other power figures decided to change to the Gregorian calendar. Catering to such needs is also only necessary with dates prior to the conversion of the Russian calendar to Gregorian, a decision made by Lenin as late as 1918, or any other conversion, such as 1582 in most of Europe, 1752 in the United States, and even more embarrassingly late in Norway.
Not mention above is the need for millisecond resolution. Most events on modern computers fall within the same second, so it is now necessary to separate them by increasing the granularity of the clock representation. This part is obviously optional in most time processing functions.
The LOCAL-TIME concept therefore represents time as three disjoint fixnums:
The choice of epoch needs some more explanation. Conversion to this system only requires subtracting two from the month and making January and February part of the previous year.
The moderate size of the fixnums allows us another enormous advantage over
customary ways to represent time. Since the leap year is now always at the
end of the year, it has no bearing on the decoding of the year, month, day, and
day-of-week of the date. By choosing this odd-looking epoch, the entire
problem with computing leap years and days evaporates. This also means
that a single, moderately large table of decoded date elements may be
pre-computed for 400 years, providing a tremendous speed-up over the
division-based calculations used by other systems.
Similarly, a table of the
decoded values of the 86400 possible seconds in a day (86401 if we allow leap
seconds) yields a tremendous speedup over division-based calculations.
(Depending on your processor and memory speeds, a factor of 10 to 50 may be
expected. for a complete decoding)
For the timezone information, the LOCAL-TIME concept implements a package, TZ, or TIMEZONE in full, which contains symbols named after the files, whose values are lazy-loaded timezone objects. Because the source files for the zoneinfo files are generally not as available as the portably coded binary information, the information are loaded into memory from the compiled files, thus maintaining maximum compatibility with the other timezone functions on the system.
In the LOCAL-TIME instances, the timezone is represented as a symbol to aid in the ability to save literal time objects in compiled Lisp files. The package TZ can easily be autoloaded in systems that support such facilities, in order to reduce the load-order complexity.
In order to increase efficiency substantially once again, each timezone object holds the last few references to timezone periods in it, in order to limit the search time. Empirical studies of long-running systems have showed that more than 98% of the lookups on a given timezone were for time in the same period, with more than 80% of the remaining lookups at the neighboring periods, so caching these values made ample sense.
+----------+----+-----+---+ +-----+------+------+This simple optimization meant 7 times more compact storage of the exact same data, with significantly improved access times, to boot (depending on processor and memory speeds as well as considerations for caching strategies, a factor of 1.5 to 3 has been measured in production).
| yyyy | mm | day |dow| |hour | min | sec |
+----------+----+-----+---+ +-----+------+------+
10 4 5 3 5 6 6
Still, 909K of storage to keep tables of precomputed dates and times may seem
a steep price to pay for the improved performance. Unsurprisingly, more
empirical evidence confirmed that most dates decoded were in the same
century. Worst case over the next few years, we will access two centuries
frequently, but it is still a waste to store four full centuries. A
reduction to 100 years per table also meant the number of years were
representable in 7 bits, meaning that an specialized vector of type (UNSIGNED-BYTE 16) could represent them all. The day
of week would be lost in this optimization, but a specialized vector of type
(UNSIGNED-BYTE 4) of the full length (146097)
could hold them if a single division to get the day of week was too
expensive. It turns out that the day of week is much less used than the
other decoded elements, so the specialized vector was dropped and an option
included with the call to the decoder to skip the day of week.
Similarly, by
representing only 12 hours in a specialized vector of type (UNSIGNED-BYTE 16), the hour would need only 4 bits and the
lookup could do the 12-hour shift in code. This reduces the table memory
needs to only 156K, and it is still faster than access to the full list
representation. This compaction yields almost a factor 42 improvement over
the naпve approach
For completeness, the bit field layout is now simplified as follows.
+-------+----+-----+ +----+------+------+Decoding the day now means finding the 400-year cycle for the day of week, the century within it for the table lookup, and adding together the values of the centuries and the year from the table, which may be 100 to represent January and February of the following century. All of this can be done with very inexpensive fixnum operations for about 2,939,600 years, after which the day will incur a bignum subtraction to bring it into fixnum space for the next 2,939,600 years. (This optimization has not actually been implemented.)
| 0-100 |1-12| 1-31| |0-11| 0-59 | 0-59 |
+-------+----+-----+ +----+------+------+
7 4 5 4 6 6
Supported formats of the timestring syntax include
The standard syntax from ISO 8601 is fairly rich with options. These are mostly unsupported due to the ambiguity they introduce. The goal with the timestring syntax is that positions and periods of time shall be so easy to read and write in an information-preserving syntax that there will be no need to cater to the information-losing formats preferred by some only because of their attempt at similarity to their spoken forms.
At present, the interface to the timestring formatter is well suited for a call from FORMAT control strings with the ~// construct, and takes arguments a follows:
The great guys at Franz Inc have helped with internal details in Allegro CL and have of course made a wonderful Common Lisp environment to begin with. Thanks in particular to Samantha Cichon and Anna McCurdy for taking care of all the details and making my stays so carefree, and to Liliana Avila for putting up with my total lack of respect for deadlines.
Many thanks to Pernille Nylehn for reading and commenting on drafts, nudging
me towards finishing this work, and for taking care of my cat Xyzzy so I could
write this in peace and deliver it at LUGM '99 without worrying about the little
furball's constant craving for attention, but also without both their warmth and
comfort when computers simply refuse to behave rationally.