Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tool reporting outdated l10n documents by lastmod difference #45844

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

seokho-son
Copy link
Member

This PR adds tool (report-outdated-by-mod.py) reporting outdated l10n documents by Lastmod difference.

This script compares markdown files across different language directories
to identify and report localized documents that may be outdated,
based on modification date differences.

It focuses primarily on:

  • Reporting outdated documents based on modification date differences.
  • Estimating false alerts.
  • Calculating the similarity between the English version and localized versions of documents.
    (similarity analysis includes line counts, special character patterns, and English word usage patterns.)

The output in table style will be useful to maintaining localized documents and also checking overall status of all languages.

How to use

$ python ./scripts/report-outdated-by-mod.py --help
Usage: report-outdated-by-mod.py [-h] [--path PATH] [target_lang ...]

    Users can specify target languages for comparison against the English base.
    If no languages are specified, all directories will be compared.

    The path to the content directory can be specified using the --path parameter; 
    if not provided, './content' or '../content' is used as the default.

positional arguments:
  target_lang  Target language directories (e.g., ko ja fr). If empty, all directories will be compared.

options:
  -h, --help   show this help message and exit
  --path PATH  Base content directory. Default is './content'

Screenshots

  • ./scripts/report-outdated-by-mod.py ko

image

image

  • ./scripts/report-outdated-by-mod.py

image

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign natalisucks for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the sig/docs Categorizes an issue or PR as relevant to SIG Docs. label Apr 11, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 11, 2024
Copy link

netlify bot commented Apr 11, 2024

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit d60e15b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/661d67a93e5d0200087a1b29
😎 Deploy Preview https://deploy-preview-45844--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@seokho-son
Copy link
Member Author

/area localization

@k8s-ci-robot k8s-ci-robot added the area/localization General issues or PRs related to localization label Apr 11, 2024
for file in files:
if file.endswith(file_extension):
full_path = os.path.join(root, file)
last_modified_time = os.path.getmtime(full_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can look up the last modification time from Git, especially for a source file, I think that helps even more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this Stack Overflow thread:
'How to get the last modification date of a file in a git repo?'

The correct way to do this is to use git log as follows.

git log -1 --pretty="format:%ci" /path/to/repo/anyfile.any

-1 restricts it to the very last time the file changed

%ci is just one of the date formats you can choose from others here at https://git-scm.com/docs/pretty-formats

To do this in Python code, we might

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think getting the last modification time from Git is better way.
However, I hope to check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sftim @jihoon-seo
It seems https://gohugo.io/methods/page/lastmod/ utilizes Git :)

Set the last modification date to the Author Date of the last Git commit for that file. See GitInfo for details.

I will update the script to utilize Git to get lastmod. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sftim
I've applied your suggestion. Thanks!

https://github.com/kubernetes/website/pull/45844/files#diff-00ab75ef0cdb9774cf44b3557729c4f87dceea713c863cfa1090e55f834dec5aR42

The outcome of lastmod is the same as the previous method (using os.path.getmtime). So, I think we don't have to change the other sections regarding this matter.

based on modification date differences.

It focuses primarily on:
1. Reporting outdated documents based on modification date differences.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if these statistics could also be published somewhere on the web periodically so that people could consult the progress status and be motivated to catch up. Maybe not the scope of this PR but an idea ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @ricardoamaro.
Yes, if there is a general agreement among the localization teams, we could potentially automate this program to run periodically (using GitHub workflows, etc.).
Once this PR is merged, I plan to introduce this tool to the localization contributors and test its usefulness.

Signed-off-by: Seokho Son <shsongist@gmail.com>
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 14, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 13, 2024
@seokho-son
Copy link
Member Author

Hi @divya-mohan0209 @reylejano @natalisucks
I think this PR is ready for approval. I believe this script is useful for localization teams as is, and the tool can be further enhanced if needed.

@tengqm
Copy link
Contributor

tengqm commented Sep 11, 2024

Since we have a much simpler script for this, i.e. scripts/lsync.sh, why bother adding a new tool which does almost the same thing?

@sftim
Copy link
Contributor

sftim commented Sep 11, 2024

If the new tool provides a benefit to a localization team, I think it's welcome, because we support localization teams to pick a workflow that works for them.

It's also OK to combine the lsync.sh and report-outdated-by-mod.py tools; that would need buy in from all the localization teams that rely on either tool.

@seokho-son
Copy link
Member Author

Hi @tengqm @sftim

I understand that lsync.sh is a simple tool that is already being used effectively by specific localization teams to track differences between documents. However, I believe the script tool introduced in this PR has a somewhat different purpose, as described in the PR content.

  • Reporting outdated documents based on modification date differences.
    Estimating false alerts. The output in table format will be useful for maintaining localized documents and checking the overall status of all languages.
  • Calculating the similarity between the English version and localized versions of documents. (The similarity analysis includes line counts, special character patterns, and English word usage patterns.)

Although it is possible to merge it with an existing script like lsync.sh, I think merging might not bring significant benefits to contributors who are already using the simple lsync.sh effectively for their purposes. In fact, it could introduce unnecessary inconvenience. Instead, I suggest treating the script introduced in this PR as a Proof of Concept and encouraging people to try it out and improve if necessary.

@sftim
Copy link
Contributor

sftim commented Oct 2, 2024

I like the idea of this, but I'm not in any localization team.

Also see #48163

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 2, 2024
@sftim
Copy link
Contributor

sftim commented Oct 2, 2024

@seokho-son I've not LGTMed or approved this because:

  • I don't do localization work enough to check whether this script is useful
  • (AIUI) I shouldn't have access to approve this change

I recommend asking localization teams to try it out and comment.

Copy link
Member

@mengjiao-liu mengjiao-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a member of the Chinese localization team, I tested this script. But I found that the result output by this script is not consistent with that of the previous script ./scripts/lsync.sh(= git log -1 --pretty="format:%ci" /path/to/repo/anyfile.any).

Take file ./content/zh-cn/docs/tasks/debug/debug-cluster/audit.md as an example(base on the latest main branch):

This is the result of running script ./scripts/report-outdated-by-mod.py (Line(D):-24, Words(D):36):

python ./scripts/report-outdated-by-mod.py  zh-cn

image

This is the result of running script ./scripts/lsync.sh(3 insertions(+), 3 deletions(-)):
image

Perhaps the difference in line numbers is due to different calculation methods. But this confuses me. Could you explain this difference?

full_path = os.path.join(path, target_lang, file)
mismatched_special, mismatched_english_words = calculate_similarity(os.path.join(dir1, file), os.path.join(dir2, file))

last_mod_colored = colored_text(f"{days} days", 91) if days > 30 else f"{days} days"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can warnings in different colors be written in the README.md so future contributors or users can understand this usage more quickly?

english_words_diff_colored = colored_text(mismatched_english_words, 94) if mismatched_english_words < 10 else mismatched_english_words

if abs(mismatched_special) < 10 :
stats["False alerts (suspected)"] += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't very meaningful if we only mark it in the statistics. How can we find these False alerts (suspected) among so many files?

return mismatched_special, mismatched_english_words

def compare_directories(path, target_langs):
dir1 = os.path.join(path, "en")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dir1 or dir2 this naming method can make it difficult for later contributors to read. 😃 Can it be named more accurately to represent its meaning?

@nate-double-u
Copy link
Contributor

@seokho-son I've not LGTMed or approved this because:

* I don't do localization work enough to check whether this script is useful

* (AIUI) I shouldn't have access to approve this change

I recommend asking localization teams to try it out and comment.

I'd like to second this—the review bot has asked me to review this PR. While I like the idea of this tool, I'm not on any localization team so my opinion of the tool isn't so useful.

/uncc @nate-double-u

@k8s-ci-robot k8s-ci-robot removed the request for review from nate-double-u October 23, 2024 23:51
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/localization General issues or PRs related to localization cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants