-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tool reporting outdated l10n documents by lastmod difference #45844
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify site configuration. |
/area localization |
for file in files: | ||
if file.endswith(file_extension): | ||
full_path = os.path.join(root, file) | ||
last_modified_time = os.path.getmtime(full_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can look up the last modification time from Git, especially for a source file, I think that helps even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this Stack Overflow thread:
'How to get the last modification date of a file in a git repo?'
The correct way to do this is to use git log as follows.
git log -1 --pretty="format:%ci" /path/to/repo/anyfile.any
-1
restricts it to the very last time the file changed
%ci
is just one of the date formats you can choose from others here at https://git-scm.com/docs/pretty-formats
To do this in Python code, we might
- call a system command, or
- use a Python package that provides Git-related functionalities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think getting the last modification time from Git is better way.
However, I hope to check
- our script layouts/partials/docs/outdated_content.html how the hugo checks it. I hope to make website notification and this script result same. Let me check it first :)
- https://github.com/morix1500/website/blob/4fa23bc60451d609f10ba2e1d79a34af5be5d904/layouts/partials/docs/outdated_content.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sftim @jihoon-seo
It seems https://gohugo.io/methods/page/lastmod/ utilizes Git :)
Set the last modification date to the Author Date of the last Git commit for that file. See GitInfo for details.
I will update the script to utilize Git to get lastmod. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sftim
I've applied your suggestion. Thanks!
The outcome of lastmod
is the same as the previous method (using os.path.getmtime
). So, I think we don't have to change the other sections regarding this matter.
based on modification date differences. | ||
|
||
It focuses primarily on: | ||
1. Reporting outdated documents based on modification date differences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if these statistics could also be published somewhere on the web periodically so that people could consult the progress status and be motivated to catch up. Maybe not the scope of this PR but an idea ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @ricardoamaro.
Yes, if there is a general agreement among the localization teams, we could potentially automate this program to run periodically (using GitHub workflows, etc.).
Once this PR is merged, I plan to introduce this tool to the localization contributors and test its usefulness.
Signed-off-by: Seokho Son <shsongist@gmail.com>
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Hi @divya-mohan0209 @reylejano @natalisucks |
Since we have a much simpler script for this, i.e. |
If the new tool provides a benefit to a localization team, I think it's welcome, because we support localization teams to pick a workflow that works for them. It's also OK to combine the |
I understand that
Although it is possible to merge it with an existing script like |
I like the idea of this, but I'm not in any localization team. Also see #48163 /remove-lifecycle rotten |
@seokho-son I've not LGTMed or approved this because:
I recommend asking localization teams to try it out and comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a member of the Chinese localization team, I tested this script. But I found that the result output by this script is not consistent with that of the previous script ./scripts/lsync.sh
(= git log -1 --pretty="format:%ci" /path/to/repo/anyfile.any
).
Take file ./content/zh-cn/docs/tasks/debug/debug-cluster/audit.md
as an example(base on the latest main
branch):
This is the result of running script ./scripts/report-outdated-by-mod.py
(Line(D):-24, Words(D):36):
python ./scripts/report-outdated-by-mod.py zh-cn
This is the result of running script ./scripts/lsync.sh
(3 insertions(+), 3 deletions(-)):
Perhaps the difference in line numbers is due to different calculation methods. But this confuses me. Could you explain this difference?
full_path = os.path.join(path, target_lang, file) | ||
mismatched_special, mismatched_english_words = calculate_similarity(os.path.join(dir1, file), os.path.join(dir2, file)) | ||
|
||
last_mod_colored = colored_text(f"{days} days", 91) if days > 30 else f"{days} days" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can warnings in different colors be written in the README.md so future contributors or users can understand this usage more quickly?
english_words_diff_colored = colored_text(mismatched_english_words, 94) if mismatched_english_words < 10 else mismatched_english_words | ||
|
||
if abs(mismatched_special) < 10 : | ||
stats["False alerts (suspected)"] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't very meaningful if we only mark it in the statistics. How can we find these False alerts (suspected)
among so many files?
return mismatched_special, mismatched_english_words | ||
|
||
def compare_directories(path, target_langs): | ||
dir1 = os.path.join(path, "en") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dir1
or dir2
this naming method can make it difficult for later contributors to read. 😃 Can it be named more accurately to represent its meaning?
I'd like to second this—the review bot has asked me to review this PR. While I like the idea of this tool, I'm not on any localization team so my opinion of the tool isn't so useful. /uncc @nate-double-u |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
This PR adds tool (
report-outdated-by-mod.py
) reporting outdated l10n documents by Lastmod difference.This script compares markdown files across different language directories
to identify and report localized documents that may be outdated,
based on modification date differences.
It focuses primarily on:
(similarity analysis includes line counts, special character patterns, and English word usage patterns.)
The output in table style will be useful to maintaining localized documents and also checking overall status of all languages.
How to use
Screenshots