Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(env-check): remove monitor initial pending to fix disk mismatch problem #3426

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

COLDTURNIP
Copy link
Contributor

@COLDTURNIP COLDTURNIP commented Dec 30, 2024

EDIT: change to draft since the initial waiting is expected for while disk mismatch really happens.

Which issue(s) this PR fixes:

Issue longhorn/longhorn#10098

What this PR does / why we need it:

Trigger the initial disk status collecting right after the disk monitor get created, to eliminate disk mismatching errors, which blocks the rest node sync logic.

Special notes for your reviewer:

Additional documentation or context

…roblem

longhorn/longhorn-10098

Signed-off-by: Raphanus Lo <yunchang.lo@suse.com>
@COLDTURNIP COLDTURNIP requested a review from derekbit December 30, 2024 09:32
Copy link

coderabbitai bot commented Dec 30, 2024

Walkthrough

The pull request introduces significant structural changes to the monitoring components in the Longhorn project, transitioning from concrete monitor implementations to a more generic, type-safe approach. The changes primarily focus on the Monitor interface and its implementations in various monitor classes, such as DiskMonitor, EnvironmentCheckMonitor, and SnapshotMonitor. These modifications enhance type specificity, encapsulation, and error handling across the monitoring system.

Changes

File Change Summary
controller/monitor/monitor.go Updated Monitor interface to support generics with a type parameter CollectedDataType
controller/monitor/disk_monitor.go Refactored DiskMonitor to use generic Monitor[map[string]*CollectedDiskInfo] type
controller/monitor/environment_check_monitor.go Transformed EnvironmentCheckMonitor to use generic Monitor[[]longhorn.Condition] type
controller/monitor/snapshot_monitor.go Restructured SnapshotMonitor to use generic Monitor[SnapshotMonitorStatus] type
controller/node_controller.go Updated monitor types and method signatures to use more specific monitor types

Assessment against linked issues

Objective Addressed Explanation
Resolve disk status sync issues during initial minutes [#10098] Changes do not directly address the root cause of disk mismatching during initial startup
Improve error handling in monitor initialization Enhanced type safety and error handling in monitor implementations
Reduce potential runtime errors Introduced more specific type handling and error management

Sequence Diagram

sequenceDiagram
    participant NC as NodeController
    participant DM as DiskMonitor
    participant DS as DataStore
    
    NC->>DM: createDiskMonitor()
    DM-->>NC: Initialize Monitor
    NC->>DM: Start()
    DM->>DS: Collect Initial Disk Data
    DM-->>NC: Notify Data Collection
Loading

The sequence diagram illustrates the high-level interaction between the NodeController and the DiskMonitor, showing the initialization and initial data collection process.

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
controller/monitor/disk_monitor.go (1)

405-405: Handling orphaned or unknown disk data

getOrphanedReplicaDataStores delegates block vs. filesystem disk handling cleanly. Keep watch for potential edge conditions if new disk types are introduced.

controller/monitor/fake_disk_monitor.go (1)

24-24: Consider documenting the return type change.
Changing the function to return DiskMonitor instead of *DiskMonitor can improve abstraction. However, external callers may find it more intuitive if this change is highlighted in a release note or function docstrings.

controller/monitor/snapshot_monitor.go (4)

Line range hint 158-180: Periodic check for snapshots is straightforward.
This approach ensures scheduled checks. If you need more granular or immediate checks triggered by external signals, consider hooking them into this code or the event queue.


258-262: RunOnce unimplemented.
If there is a test scenario or CLI usage that expects a one-off snapshot check, implementing RunOnce might be beneficial. Otherwise, consider documenting this method’s state as intentionally unimplemented.


317-324: inProgressSnapshotCheckTasks concurrency.
The lock usage is correct. Consider whether a sync.Map or a concurrent set might simplify usage, but your current approach is acceptable.


Line range hint 458-501: Syncing checksum and evicting corrupted replicas.
The logic is clear and thorough. You might add more logs if you suspect certain replica addresses are frequently corrupted, but otherwise this is well-executed.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1875421 and 9d63fc8.

📒 Files selected for processing (7)
  • controller/monitor/disk_monitor.go (11 hunks)
  • controller/monitor/environment_check_monitor.go (16 hunks)
  • controller/monitor/fake_disk_monitor.go (1 hunks)
  • controller/monitor/fake_environment_check_monitor.go (4 hunks)
  • controller/monitor/monitor.go (1 hunks)
  • controller/monitor/snapshot_monitor.go (20 hunks)
  • controller/node_controller.go (7 hunks)
🧰 Additional context used
🪛 GitHub Check: CodeFactor
controller/monitor/snapshot_monitor.go

[notice] 94-94: controller/monitor/snapshot_monitor.go#L94
Exported func NewSnapshotMonitor returns unexported type *monitor.snapshotMonitorImpl, which can be annoying to use. (unexported-return)

🔇 Additional comments (80)
controller/monitor/monitor.go (2)

12-12: Use of generics enhances type safety

Introducing a type parameter for the Monitor interface significantly improves type safety and clarity in monitor implementations.


16-16: Typed return type adds clarity

Changing GetCollectedData to return CollectedDataType instead of interface{} removes ambiguity and ensures that callers can rely on a strongly typed result.

controller/monitor/disk_monitor.go (18)

38-39: Align the type definition with the interface usage

Defining type DiskMonitor Monitor[map[string]*CollectedDiskInfo] clarifies the collected data structure. It consistently matches the generics-based Monitor interface.


40-41: Implementation check

Declaring var _ DiskMonitor = &diskMonitorImpl{} ensures diskMonitorImpl correctly implements the DiskMonitor interface. This is a good pattern to catch compile-time interface mismatches.


42-42: Consistent naming for implementation struct

Renaming the struct to diskMonitorImpl clarifies that it's the concrete implementation of the DiskMonitor interface. This is a clean approach.


48-48: Lock usage observation

The combination of sync.Once and sync.RWMutex looks appropriate for safely initializing and accessing collectedData. Remember to avoid potential deadlocks if other methods acquire both locks in a reversed order.


77-77: Constructor’s explicit return type

The updated return type (DiskMonitor, error) is consistent with the generics-based interface. This is beneficial for type-checking and maintaining a clean API.


80-80: Initialization correctness

All handlers, locks, and collectedData fields are properly instantiated. The readiness for immediate usage of diskMonitorImpl is clear.


100-119: Immediate run before goroutine

Running m.run() once before starting the polling goroutine ensures the initial disk state is collected promptly, aligning well with the PR's stated objectives to fix mismatch issues at startup.


122-122: Stop method clarity

Calling m.quit() cleanly cancels the context, ensuring the goroutine can terminate gracefully.


126-127: Synchronous run usage

RunOnce() calling m.run() is a straightforward design for single-iteration data collection. This helps in debugging or on-demand fetching.


130-130: Optional configuration extension

UpdateConfiguration(map[string]interface{}) error { return nil } is a no-op. If future configuration changes are needed, this design is easily extended.


134-134: Deep copying collected data

Copying m.collectedData into a new map ensures concurrent readers won’t inadvertently mutate this internal state. The approach is well suited for safe external consumption.


146-146: Main monitoring loop logic

run() properly retrieves node info, collects disk data, and triggers a sync callback if changes are detected. This pattern is consistent with the normal monitor workflow.


167-167: Robust instance manager retrieval logic

getRunningInstanceManagerRO handles multiple data engines and searches for a running manager if the default one isn’t found. This fallback approach is practical for ensuring function continuity.


193-193: Lazily creating disk service clients

newDiskServiceClients attempts to instantiate clients for each data engine. The fallback logic ensures no immediate crash if a key client is unavailable.


220-220: Defensive closure of clients

closeDiskServiceClients carefully closes all active clients to prevent resource leaks and leftover open connections.


230-230: Central place for disk data collection

collectDiskData centralizes the logic of retrieving disk configs, stats, and orphaned replicas. This function’s approach is cohesive and is a good single-responsibility method.


416-416: Skipping found replicas in the data store

getOrphanedReplicaLvolNames discards recognized replicas from the orphan set. This is consistent with the approach for cleaning up unknown volumes.


431-431: Filesystem orphan detection logic

getOrphanedReplicaDirectoryNames uses both node status checks and the volume metadata file presence. This multi-step validation is beneficial for avoiding false positives.

controller/monitor/environment_check_monitor.go (24)

48-49: Generics-based definition

type EnvironmentCheckMonitor Monitor[[]longhorn.Condition] provides consistent usage of generics, mirroring DiskMonitor.


50-51: Correct interface check

var _ EnvironmentCheckMonitor = &environmentCheckMonitorImpl{} is a good mechanism to validate the implementation at compile time.


52-52: Implementation naming

environmentCheckMonitorImpl follows the same naming convention as diskMonitorImpl for clarity and consistency across the codebase.


57-57: Concurrent data access

Use of sync.RWMutex for collectedData is a safe approach to concurrent reads and occasional writes.


68-68: Constructor returning interface

NewEnvironmentCheckMonitor(...) (EnvironmentCheckMonitor, error) aligns with the new generics-based Monitor interface. This fosters type clarity throughout the code.


71-71: Comprehensive initialization

All fields (collectedDataLock, collectedData, syncCallback) are properly initialized. This ensures readiness for immediate monitoring.


85-104: Immediate run logic

The Start method pre-runs the environment check before scheduling periodic checks, ensuring immediate data availability. The poll logic is consistent with DiskMonitor.


107-107: Graceful shutdown

The Stop() method properly invokes m.quit(). This pattern is consistent across monitors for concurrency termination.


111-112: On-demand environment check

RunOnce is aligned with the overall design, providing a quick single-run check method when needed.


115-115: Optional flexible config

Similar to DiskMonitor, UpdateConfiguration is left unimplemented. This design is flexible if future runtime adjustments become necessary.


119-119: Safe copying of condition data

copier.CopyWithOption usage ensures that callers receive a distinct copy of the environment check conditions without risking concurrency issues.


131-131: Sync callback on changes

Within run(), the new conditions are compared to the existing. Upon differences, a syncCallback is triggered. This is crucial for reactive environment-based logic.


152-152: Fallback for missing kube node

When GetKubernetesNodeRO fails, it returns an empty set of conditions, which ensures the rest of the logic can proceed without a crash.


163-163: Condition-based environment check aggregator

environmentCheck collects multiple checks (packages, multipathd, NFS versions, kernel modules, huge pages) into a single set of conditions. This design is modular and extendable.


189-189: Package installation checks per distribution

syncPackagesInstalled uses OS detection to decide which packages to verify. This approach is thorough, utilizing distribution-based branching.


257-257: Talos edge case handling

syncPackagesInstalledTalosLinux accommodates the specialized environment of Talos. These checks ensure package existence despite the unique OS structure.


333-333: Detecting multipathd conflict

syncMultipathd warns about known issues with multipathd. This is an example of hazard-based environment checks that can prevent data corruption or performance pitfalls.


352-352: Batching package checks

checkPackageInstalled abstracts the logic of validating packages in multiple namespaces. Combining them prevents repetitive code paths.


370-370: Huge page capacity check

Ensuring the node’s hugepage capacity meets the configured threshold is essential for V2 data engine usage. This logic is properly integrated with node conditions.


404-404: Kernel module scanning approach

checkKernelModulesLoaded tries kmod first, with fallback to config checks if not found. This layered approach is robust for multi-environment setups.


465-465: Validating modules in kernel config files

checkModulesLoadedByConfigFile elegantly handles both /boot and /proc fallback scenarios. This logic is thorough for containerized environments.


487-487: NFS mount config checks

checkNFSMountConfigFile ensures the default NFS protocol version is recognized and supported. This step is critical to avoid unexpected NFS issues.


503-503: Module state verification

checkKernelModuleEnabled inspects kernel configs for y or m states, then uses kmod checks for dynamically loaded modules. This is a robust multi-step approach.


539-539: Confirm NFS client availability

synchronizeNFSClientVersion checks whether the node’s kernel modules and default mount config allow NFS versions that Longhorn supports, making environment readiness more predictable.

controller/node_controller.go (18)

59-59: Switch to specialized monitor types

diskMonitor monitor.DiskMonitor enforces type specificity, removing the need for additional type assertions when retrieving disk data.


60-60: Enhanced environment checks

Using environmentCheckMonitor monitor.EnvironmentCheckMonitor clarifies usage, eliminating generic type confusion.


62-62: Snapshot monitor specialized field

A distinct monitor.SnapshotMonitor field for snapshot operations ensures each monitor type is semantically correct.


477-479: Graceful handling of snapshot monitor errors

A check is performed to handle failures in GetCollectedData() gracefully, preventing a crash loop if the snapshot monitor is temporarily unavailable.


1276-1276: Lazy creation of DiskMonitor

createDiskMonitor() returns the existing monitor if present, preventing redundant re-initialization. This pattern is effective in ensuring a single consistent monitor instance.


1281-1281: Nesting the run operation

Invoking Start() after constructing the monitor ensures the initial data collection is triggered right away.


1285-1285: Context-based concurrency

The Start() method uses the same approach (PollUntilContextCancel) as in other monitors, ensuring consistent concurrency patterns across the codebase.


1287-1287: Retain reference

Assigning the newly created monitor to nc.diskMonitor ensures that future references use the same monitor instance for disk collection logic.


1289-1289: Returning interface ensures flexibility

Returning the interface type DiskMonitor keeps the calling code decoupled from the implementation details.


1292-1292: EnvironmentCheckMonitor creation pattern

Similar to createDiskMonitor, the method lazily initializes the environment monitor, preventing duplication.


1297-1297: Immediate monitoring start

mon.Start() triggers environment checks without waiting for a next reconcile cycle, ensuring environment mismatches are caught early.


1301-1301: Consistent concurrency initialization

The environment monitor uses the same concurrency design as the disk monitor, keeping concurrency patterns uniform.


1303-1303: Single reference assignment

Storing mon in nc.environmentCheckMonitor ensures stateful reference for environment checks throughout the node controller’s lifecycle.


1305-1305: Return new or existing environment monitor

Returning the EnvironmentCheckMonitor interface encapsulates internal details, consistent with the rest of the design.


1444-1444: Verifying disk mismatch

say if matched := isDiskMatched(); ... calls out mismatching disks with a warning, rather than silently ignoring. This direct check is essential for debugging.


1458-1460: Robust environment condition retrieval

Collecting environment conditions from the monitor and tolerating errors ensures partial checks do not block the entire node sync logic.


1580-1580: Snapshot monitor creation pattern

Similar to disk/environment monitors, createSnapshotMonitor returns early if already initialized, preventing extra goroutines.


1595-1595: Consistent concurrency model for snapshot monitor

mon.Start() uses the same approach as the other monitors, ensuring uniform concurrency across all monitor implementations.

controller/monitor/fake_environment_check_monitor.go (5)

29-29: Constructor returning an interface

NewFakeEnvironmentCheckMonitor(...) (EnvironmentCheckMonitor, error) ensures that tests can swap the real environment check monitor with this fake seamlessly.


48-48: Permission for immediate run

true parameter to PollUntilContextCancel triggers a check immediately on Start(), matching production behavior in a test/fake context.


66-66: RunOnce pattern reusability

RunOnce aligning with the same usage as the real monitor ensures test code can replicate single-run environment checks.


73-73: Concrete typed return

Replacing (interface{}, error) with ([]longhorn.Condition, error) clarifies the test behavior by returning strongly typed data from the fake.


85-85: Minimal environment check cycle

run() sets an empty slice of conditions, simulating a check result. This is sufficient for testing syncCallback logic without real scanning.

controller/monitor/fake_disk_monitor.go (1)

27-27: Returning a pointer to diskMonitorImpl as DiskMonitor.
Returning an interface while constructing a pointer to the concrete struct is a sound approach for encapsulation. This pattern should be consistently used in the rest of the codebase for clarity.

controller/monitor/snapshot_monitor.go (12)

64-64: Easy-to-read alias for generics.
Defining type SnapshotMonitor Monitor[SnapshotMonitorStatus] can improve readability. Ensure other places—like references or documentation—are updated for consistent usage.


66-66: Enabling compile-time type checks for the interface.
The assignment var _ SnapshotMonitor = &snapshotMonitorImpl{} is a clean way to verify that snapshotMonitorImpl indeed implements the interface.


94-94: Unexported return type flagged by static analysis.
Returning (*snapshotMonitorImpl, error) from a factory function named NewSnapshotMonitor is okay if you are certain no external package needs a named type. If external usage or testing expects a fully exported type, consider returning the interface directly.

🧰 Tools
🪛 GitHub Check: CodeFactor

[notice] 94-94: controller/monitor/snapshot_monitor.go#L94
Exported func NewSnapshotMonitor returns unexported type *monitor.snapshotMonitorImpl, which can be annoying to use. (unexported-return)


125-132: Ensure concurrency safety when starting the monitor.
The usage of startOnce.Do prevents multiple goroutines from calling Start(). Confirm that further concurrent calls won’t reinitialize or break any shared state.


Line range hint 135-153: Efficient snapshot event handling.
Adding snapshot tasks to the queue is non-blocking and appears well-coordinated. Make sure large volumes of events don’t overwhelm downstream processing.


Line range hint 198-222: Robust error handling logic.
Handling the snapshot hashing error gracefully and controlling the requeue limit is well-implemented. Confirm that the error strings are stable enough for string matching.


245-250: Clean shutdown sequence.
Calling m.snapshotCheckTaskQueue.ShutDown() followed by stopping the scheduler helps avoid race conditions in concurrent operations.


297-303: Snapshot status retrieval is thread-safe.
Locking before accessing m.SnapshotMonitorStatus is correct. Keep an eye on similar data structures for consistent locking patterns.


Line range hint 363-383: Guard conditions are neatly organized.
Ensuring the volume isn’t purging, rebuilding, restoring, or migrating before triggering snapshot hashing is crucial. This helps prevent inconsistent snapshot states.


Line range hint 403-429: Comprehensive retry logic with retry-go.
The error-based retry mechanism ensures snapshot hashing is retried up to a limit. Good usage of LastErrorOnly to avoid spamming logs.


Line range hint 440-449: Dedicated checks for volume restoration.
Properly gating the hashing process while the volume is restoring prevents corrupted states or conflicting operations.


580-580: Falling back to global data integrity setting.
When volume.Spec.SnapshotDataIntegrity is Ignored, retrieving the global setting ensures consistent behavior. This is a robust fallback mechanism.

@COLDTURNIP COLDTURNIP marked this pull request as draft December 30, 2024 09:55
@COLDTURNIP COLDTURNIP self-assigned this Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant