Skip to content

Conversation

robertbastian
Copy link
Member

We should put the cutoff to "historical Chinese" at 1912, as that's when today's algorithm started being used (and it's also more than 100 years ago, a value that was agreed on in #5778).

In the other direction, add hardcoded data until 2125 (100 years in the future).

Copy link

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please just use PMO data which starts in 1900. I certainly don't want Pinqi for dates when anyone currently alive has a birthday.

@sffc
Copy link
Member

sffc commented Sep 22, 2025

We should put the cutoff to "historical Chinese" at 1912, as that's when today's algorithm started being used (and it's also more than 100 years ago, a value that was agreed on in #5778).

Further down in that thread we revise the "minimum of 100 years" to "what the HKO ships"

@robertbastian
Copy link
Member Author

Further down in that thread you say you plan to ship HKO data; that does not make it an agreed-upon plan. There is no relevance to the year 1900 other than that it starts a Gregorian century, we have correct data for 1899 as well.

when anyone currently alive has a birthday

If you want that to be our criterion for including data, please back it up with some evidence that this results in a cutoff at 1900.

@robertbastian robertbastian requested a review from sffc September 22, 2025 12:27
@sffc
Copy link
Member

sffc commented Sep 22, 2025

The "person alive" criterion would imply 1908 IIRC which we round to 1900 because it signals that it is a human picked arbitrary cutoff

@robertbastian
Copy link
Member Author

I can find no evidence that anyone born in China before 1912 is alive today: https://en.wikipedia.org/wiki/List_of_the_oldest_people_by_country

@sffc
Copy link
Member

sffc commented Sep 22, 2025

Also the PMO tables explicitly mark when pre-1912 diverges, and they impact some minor solar terms but nothing that causes days or months to shift, which I think I posted in one of the issues

@sffc
Copy link
Member

sffc commented Sep 22, 2025

Also doesn't this unfix the 1906 issue you fixed last week? #6454

Basically our policy as I understand it is that we aim to match ground truth for historical dates going back arbitrarily far, with an implicit statute of limitations when either data gets too big or we can't independently verify the data correctness.

@robertbastian
Copy link
Member Author

Also doesn't this unfix the 1906 issue you fixed last week? #6454

For all we know that same issue exists in 1899. I still haven't seen an argument why 1900 is a more epochal year for the Chinese calendar than 1912.

@robertbastian
Copy link
Member Author

with an implicit statute of limitations when either data gets too big or we can't independently verify the data correctness.

"Data getting too big" is an arbitrary cutoff. We can independently verify data before 1900 as well, the PMO published this book for example: https://www.las.ac.cn/front/book/detail?id=34e0e2b422dd15f2231febddb7d764b8. We can do this arbitrarily far back, e.g. using the sources listed in https://ytliu0.github.io/ChineseCalendar/rules.html.

The fact remains that anything before 1912 does not follow GB/T 33661-2017/Calendrical Calculations, and everything after does. This is the most natural place to draw the line between "historical data" and "current data".

@Manishearth
Copy link
Member

My position is similar to Shane's, 1900 is a good cutoff and we have published data for it from a trusted source. The argument that we don't actually have people with birthdays before 1912 is compelling, though.

@robertbastian
Copy link
Member Author

We have published data from a trusted source since at least 1645

@Manishearth
Copy link
Member

Not the argument I was making.

1900 is a decent cutoff, and we have published data from a trusted source, not because we have published data from a trusted source.

Why I think it is a decent cutoff is roughly along the lines of what Shane said about artificial boundaries and birthdays.

@robertbastian
Copy link
Member Author

Why I think it is a decent cutoff is roughly along the lines of what Shane said about artificial boundaries and birthdays.

Please elaborate? Shane's birthday argument does not hold water.

1912 is an epochal year for China, with the end of the Qing dynasty and the new Calendar rules. 1900 is completely arbitrary because China did not even use the Gregorian calendar then.

@Manishearth
Copy link
Member

The birthday argument is about getting a minimum, I like adding some padding where possible, and here it is quite possible.

1912 being epochal is somewhat compelling, though.

@robertbastian
Copy link
Member Author

The "person alive" criterion would imply 1908

Still waiting for a source for this.

@sffc
Copy link
Member

sffc commented Sep 23, 2025

The "person alive" criterion would imply 1908

Still waiting for a source for this.

Last I checked the oldest person alive was born in 1908 but now it's 1909; still rounds to 1900.

@robertbastian
Copy link
Member Author

Still no source

@Manishearth
Copy link
Member

@Manishearth
Copy link
Member

Amusingly the oldest verified living Chinese person according to that list was born in 1913.

That's probably not a coincidence, this list is of verified people (there are a lot of claims, many of them later revealed to be fake) and I suspect Chinese recordskeeping changed with the ROC.

Though the Qing were pretty good at keeping records....

@robertbastian
Copy link
Member Author

Ethel Caterham

Well she's not Chinese is she

@Manishearth
Copy link
Member

Right, and that's because this cutoff is a pan-calendar concept (in @sffc's version of it, and mostly mine). This is also why 1900 is a "round number" here: it's a pan-calendar concept, and this is a round number in ISO.

The idea is to have a cutoff where we guarantee pan-calendar ground truth accuracy as much as possible, which I think is generally a good idea because then our entire API surface becomes valid for those dates. It is not challenging to do this; we have the data and we have the algorithms, the only "problem" is that it's a bit weird to be doing something different for a short span of 12 years. Overall I think I think that weirdness is a small cost to pay, and it'll be moot if we switch to a different approximation pre-1900 anyway.

Individual calendars can, on top of the 1900 cutoff, make guarantees of ground truth accuracy for wider ranges.

@sffc
Copy link
Member

sffc commented Sep 24, 2025

The year number 1900 just got codified into the Temporal specification.

tc39/proposal-temporal#3152

@robertbastian
Copy link
Member Author

No. That spec change is about how to find a reference year, i.e. to give 1900-1972 precedence over 1972-2035. That does not matter for this issue, because no Chinese reference year is in the range 1900-1912 (and even if one was, there are already reference years before the modern range anyway).

@sffc
Copy link
Member

sffc commented Sep 24, 2025

The basis for the Temporal change is: "historical dates might change as more is discovered about how the calendar was used". This is exactly the case with 1906. The ground truth differs from the approximation we use. If we set 1912 as the start date, then dates between 1900 and 1912 are ones we consider subject to change, but the Temporal spec says that we should do that only prior to 1900.

@robertbastian robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Sep 25, 2025
@sffc
Copy link
Member

sffc commented Sep 27, 2025

WG notes:

  • @robertbastian Hardcoded chinese data is 1900-2100. (a) want to extend 100y so 2125
  • @robertbastian This is a performance thing after 1912, we match the spec
  • @robertbastian Before 1912 we diverge from ground truth in the algorithm, so hardcoding is actually about correctness
  • @robertbastian It's weird for us to say "we use the right algorithm 1912 onwards, then have hardcoded data, then we are proleptic"
  • @hsivonen How bad would it be to say this is correct from 1912 onwards and is proleptic using the same algorithm before that?
  • @Manishearth We plan to be proleptic with a Pinqi approximation that we haven't completely figured out. I like having ground-truth invariants that, for everyone, from 1900 we match ground truth. It is easy for us to do this. Calendars can extend that range.
  • @Manishearth I view these things as having a "ground truth validity", "data source", and "algorithm" as different things. I think it's fine for us to say "The chinese calendar follows Chinese ground truth from 1900 onwards, which is verified against the PMO data, using the GB/T algorithm from 1912 onwards and a different one before that. Outside of this range, we fall back to the GB/T algorithm, but this may change."
  • @sffc Note that when the 1900 cutoff was something we initially thought of there was a Japanese or Chinese person alive. But either way I want to round down.
  • @robertbastian 1912 is when the Qing dynasty fell, it's much more recognizeable
  • @sffc 1900 is a globally recognizeable year. These are the dates we guarantee globally. This applies to Chinese as using the PMO data from 1900.
  • @sffc The whole point of these calendars is to define ground truth, even if it's to say "lunar ground truth" it is ground truth.
  • @hsivonen what is our UAQ start date?
  • @robertbastian 1882.
  • @hsivonen and Japanese?
  • @robertbastian 1868, Meiji
  • @sffc The thing I care about is a single global year range where all calendars have a well defined ground truth, and 1900 is an excellent choice for that year range. Maybe there are other choices, which we can debate.
  • @Manishearth one cop out is to define ±100y from 2025 as the global range, 1925 is the mandated minimum, and then individual calendars opt in to more range (Chinese would opt in to 1912)
  • @sffc ±100y doesn't cover live-person birthdays. Also, once we choose a start year, we should stick to it.
  • @hsivonen Without taking a position for a global range, observing that the most constraining calendar is UAQ, which depends on an authority.
  • @sffc before I knew this was controversial we landed the referenceYear change to Temporal that codified 1900. That year is derived from the birthday heuristic.
  • @Manishearth I don't think that algorithm is super relevant because MonthDay doesn't interoperate between calendars at all.
  • @sffc (1) I believe there should be a global range. (2) I think aligning it with the birthday heuristic, which is the same heuristic used for the MonthDay reference year algorithm, is a sensible choice.
  • @Manishearth generally find the MonthDay link to be not a big difference here since it doesn't interoperate
  • @sffc it's linked because the referenceYear algorithm should return a year-month-day that isn't subject to implementation-defined approximation
  • @nekevss I'm sympathetic to reference year not dictating the decisions for other things
  • @nekevss Generally don't think that starting dates should move around
  • @Manishearth i think we should be okay with expanding starting dates
  • @nekevss yes!
  • @hsivonen The birthday criterion is useful for informing what to attempt. I think rounding to 1900 is less strongly justified. A bunch of these calendars are "easily proleptic", for the rest it's somewhat relevant as to what the adoption start / scheme is. E.g. for Japanese I'm sympathetic to the JDK reasoning ("they start with Jan 1 6 meiji", which is when the current calendar scheme was adopted) for Meiji rather than Temporal, but they both end up with the same conclusion.
  • @Manishearth yeah, that makes some sene. I do think in the Chinese case the calendar changed within the same shape and that's different from changing calendar system

No conclusion yet.

@sffc
Copy link
Member

sffc commented Sep 27, 2025

@Manishearth made a point yesterday afternoon that changed my perspective a bit.

My hardline has been interop between Temporal implementations. This is what has been driving the Intl Era Month Code proposal, the strict rules for Reference Year, etc.

What @Manishearth pointed out is that dates of birth (PlainDates as opposed to MonthDays) are themselves not defined by the calendar but are rather defined as a specific Rata Die. Therefore, serializing a date of birth to an IXDTF string and parsing it in another implementation is guaranteed to work, regardless of which set of Chinese approximations an engine happens to be using.

Therefore, if an engine has the wrong mapping between ISO and a calendar for someone's birthday, it is not an interop bug; it is an engine-specific user bug.

To be clear, an engine-specific user bug is still one that I care a lot about. But, it's one where I'm more amenable to considering calendar-specific solutions instead of global solutions. In other words, I still prefer the ICU4X Chinese calendar to match PMO ground truth through 1900, but it's more because "well the data is there and it costs only a few extra bytes and then we fully align with PMO and can move on" and not "we need to enforce 1900 due to interop concerns".


The place where interop concerns still come up is if any MonthDay reference dates land in this range. We happen to be lucky and they don't. However, if they did, then this would be a hard objection from me at least until we updated the Temporal spec to use different reference years.

@robertbastian
Copy link
Member Author

robertbastian commented Sep 29, 2025

Can you suggest phrasing for the docs of China for the status quo (which you're arguing for)? Note that the current docs are not accurate for the range 1901-1912

@CLAassistant

This comment was marked as spam.

@sffc
Copy link
Member

sffc commented Sep 30, 2025

Sure. Something like,

ICU4X implements calendar rules according to GB/T 33661-2017 from 1912 through 2100. Outside that range, ICU4X attempts to match ground truth where there is unambiguous, high-quality data, and falls back to an arbitrary proleptic approximation for distant dates.

robertbastian pushed a commit that referenced this pull request Oct 1, 2025
The code and data used for fetching this will be pushed up to a separate
(private) Unicode repo once we have one. You can find the cleaned up
source data in
https://gist.github.com/Manishearth/d8c94a7df22a9eacefc4472a5805322e.

I'm imagining that post-1950 data will change or be removed with
#7006



The initial motivation here was to fix the apparent ground truth
mismatch found in
https://github.com/unicode-org/icu4x/pull/7007/files#r2393049682. Turns
out it was a different problem, and it has been fixed in
#7013.

We may potentially need the same discussion as #6970 about whether we
care about these pre-1912 dates, since that's the only time this
diverges.
@robertbastian robertbastian deleted the cdata branch October 1, 2025 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss-priority Discuss at the next ICU4X meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants