-
Notifications
You must be signed in to change notification settings - Fork 221
Change range for Chinese data to 1912-2125 #6970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. |
a748c5b
to
1068488
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please just use PMO data which starts in 1900. I certainly don't want Pinqi for dates when anyone currently alive has a birthday.
Further down in that thread we revise the "minimum of 100 years" to "what the HKO ships" |
Further down in that thread you say you plan to ship HKO data; that does not make it an agreed-upon plan. There is no relevance to the year 1900 other than that it starts a Gregorian century, we have correct data for 1899 as well.
If you want that to be our criterion for including data, please back it up with some evidence that this results in a cutoff at 1900. |
The "person alive" criterion would imply 1908 IIRC which we round to 1900 because it signals that it is a human picked arbitrary cutoff |
I can find no evidence that anyone born in China before 1912 is alive today: https://en.wikipedia.org/wiki/List_of_the_oldest_people_by_country |
Also the PMO tables explicitly mark when pre-1912 diverges, and they impact some minor solar terms but nothing that causes days or months to shift, which I think I posted in one of the issues |
Also doesn't this unfix the 1906 issue you fixed last week? #6454 Basically our policy as I understand it is that we aim to match ground truth for historical dates going back arbitrarily far, with an implicit statute of limitations when either data gets too big or we can't independently verify the data correctness. |
For all we know that same issue exists in 1899. I still haven't seen an argument why 1900 is a more epochal year for the Chinese calendar than 1912. |
"Data getting too big" is an arbitrary cutoff. We can independently verify data before 1900 as well, the PMO published this book for example: https://www.las.ac.cn/front/book/detail?id=34e0e2b422dd15f2231febddb7d764b8. We can do this arbitrarily far back, e.g. using the sources listed in https://ytliu0.github.io/ChineseCalendar/rules.html. The fact remains that anything before 1912 does not follow GB/T 33661-2017/Calendrical Calculations, and everything after does. This is the most natural place to draw the line between "historical data" and "current data". |
My position is similar to Shane's, 1900 is a good cutoff and we have published data for it from a trusted source. The argument that we don't actually have people with birthdays before 1912 is compelling, though. |
We have published data from a trusted source since at least 1645 |
Not the argument I was making. 1900 is a decent cutoff, and we have published data from a trusted source, not because we have published data from a trusted source. Why I think it is a decent cutoff is roughly along the lines of what Shane said about artificial boundaries and birthdays. |
Please elaborate? Shane's birthday argument does not hold water. 1912 is an epochal year for China, with the end of the Qing dynasty and the new Calendar rules. 1900 is completely arbitrary because China did not even use the Gregorian calendar then. |
The birthday argument is about getting a minimum, I like adding some padding where possible, and here it is quite possible. 1912 being epochal is somewhat compelling, though. |
Still waiting for a source for this. |
Last I checked the oldest person alive was born in 1908 but now it's 1909; still rounds to 1900. |
Still no source |
Amusingly the oldest verified living Chinese person according to that list was born in 1913. That's probably not a coincidence, this list is of verified people (there are a lot of claims, many of them later revealed to be fake) and I suspect Chinese recordskeeping changed with the ROC. Though the Qing were pretty good at keeping records.... |
Well she's not Chinese is she |
Right, and that's because this cutoff is a pan-calendar concept (in @sffc's version of it, and mostly mine). This is also why 1900 is a "round number" here: it's a pan-calendar concept, and this is a round number in ISO. The idea is to have a cutoff where we guarantee pan-calendar ground truth accuracy as much as possible, which I think is generally a good idea because then our entire API surface becomes valid for those dates. It is not challenging to do this; we have the data and we have the algorithms, the only "problem" is that it's a bit weird to be doing something different for a short span of 12 years. Overall I think I think that weirdness is a small cost to pay, and it'll be moot if we switch to a different approximation pre-1900 anyway. Individual calendars can, on top of the 1900 cutoff, make guarantees of ground truth accuracy for wider ranges. |
The year number 1900 just got codified into the Temporal specification. |
No. That spec change is about how to find a reference year, i.e. to give 1900-1972 precedence over 1972-2035. That does not matter for this issue, because no Chinese reference year is in the range 1900-1912 (and even if one was, there are already reference years before the modern range anyway). |
The basis for the Temporal change is: "historical dates might change as more is discovered about how the calendar was used". This is exactly the case with 1906. The ground truth differs from the approximation we use. If we set 1912 as the start date, then dates between 1900 and 1912 are ones we consider subject to change, but the Temporal spec says that we should do that only prior to 1900. |
WG notes:
No conclusion yet. |
@Manishearth made a point yesterday afternoon that changed my perspective a bit. My hardline has been interop between Temporal implementations. This is what has been driving the Intl Era Month Code proposal, the strict rules for Reference Year, etc. What @Manishearth pointed out is that dates of birth (PlainDates as opposed to MonthDays) are themselves not defined by the calendar but are rather defined as a specific Rata Die. Therefore, serializing a date of birth to an IXDTF string and parsing it in another implementation is guaranteed to work, regardless of which set of Chinese approximations an engine happens to be using. Therefore, if an engine has the wrong mapping between ISO and a calendar for someone's birthday, it is not an interop bug; it is an engine-specific user bug. To be clear, an engine-specific user bug is still one that I care a lot about. But, it's one where I'm more amenable to considering calendar-specific solutions instead of global solutions. In other words, I still prefer the ICU4X Chinese calendar to match PMO ground truth through 1900, but it's more because "well the data is there and it costs only a few extra bytes and then we fully align with PMO and can move on" and not "we need to enforce 1900 due to interop concerns". The place where interop concerns still come up is if any MonthDay reference dates land in this range. We happen to be lucky and they don't. However, if they did, then this would be a hard objection from me at least until we updated the Temporal spec to use different reference years. |
Can you suggest phrasing for the docs of |
This comment was marked as spam.
This comment was marked as spam.
Sure. Something like,
|
The code and data used for fetching this will be pushed up to a separate (private) Unicode repo once we have one. You can find the cleaned up source data in https://gist.github.com/Manishearth/d8c94a7df22a9eacefc4472a5805322e. I'm imagining that post-1950 data will change or be removed with #7006 The initial motivation here was to fix the apparent ground truth mismatch found in https://github.com/unicode-org/icu4x/pull/7007/files#r2393049682. Turns out it was a different problem, and it has been fixed in #7013. We may potentially need the same discussion as #6970 about whether we care about these pre-1912 dates, since that's the only time this diverges.
We should put the cutoff to "historical Chinese" at 1912, as that's when today's algorithm started being used (and it's also more than 100 years ago, a value that was agreed on in #5778).
In the other direction, add hardcoded data until 2125 (100 years in the future).