-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for scrubbing all chunks on a chunkserver for CRC errors #624
base: master
Are you sure you want to change the base?
Conversation
The first few red lights:
1a. Also, remember that when you do add locks, they should be acquired for as little time as possible, because taking a lock for a long time leads to huge performance drop (other threads would do nothing until you release it).
it's just
if somebody uses a chunkserver on one system and compiles it on a system with totally different signal numbers, then it's not something we should care about. This one would unnecessarily reserve 3 signal numbers for one purpose.
To sum up, I think it's better if "scrubbing" is implemented by amending existing test thread, not adding another one. |
Thank you for your advice, as wrote in other thread, i'm not a C programmer. I'll try again to reuse the tester thread, that would be a better solution for sure. |
One of the issue I had, was how to detect the end of the loop, in other words: how can I know that all chunks was scrubbed, inside the tester thread ? I'm refactoring the whole procedure, but insted of increasing the chunks per second, i'm using a variable to see if scrub is running. If is running, no "rate limit" is applied and all chunks are checked in one after the other |
It's hard to define "all chunks", because number of chunks constantly changes on a working installation. I already have an old commit that makes test_thread more frequent, I'll dig it out and post it here. From there, it's not hard to create a script that changes chunkserver config, reloads, and thus turns on "scrubbing". After that, there are several ways to check if "all" chunks were checked. One of them is to add a counter that resets on reload and can be queried to check how many chunks were already checked. Then, since each chunk is checked once, you can guess when the scrub covered all chunks already. |
I still think the easiest scrub is to null read all data from a mount at each chunk server... but thats just me leveraging what's already available. Shrugs. |
That's why, my first version, used a static array populated on scrub start.
This is cool, and should be merged, but is not a proper solution for a scrubbing. Our current test system has about 9.000.000 chunks.
That what I did, here: https://github.com/guestisp/lizardfs/blob/a7b3f02c7b7564da8b6c481bc805903f7425e391/src/chunkserver/hddspacemgr.cc#L2975 but if chunks always change (as it should be), the current tester implementation will never finish (that's good for the tester thread). The scrub should finish. What I'm trying to do is to fetch a list of existing chunks at a given time (the time of scrubbing start), loop through them and then stop at the end. I can read the first chunk path, then go on until the same chunk is seen again. When we see the first scrubbed chunk for the second time, all chunks where scrubbed
Yes, but it's the same for ZFS. So the native ZFS scrub is nonsense, as you can easily read all data from disk (ZFS will trigger a scrub on every read). |
Ok, still trying to implement this feature by resuing If I understood properly, this is a huge infinite loop Is this true ? Now, what I'm unable to understand is what the following code does: variables are too "short" and not self-explaining Another question: is that thread self-updating when chunks are added or removed ? How can I "fix" that ? I don't want to create an infinite loop for the scrub. Scrubbing should start, scrub what already exist at the time of start, and then finish. |
By sending a SIGUSR2 to a chunkserver process, a scrub loop is started (or aborted).
Scrub will check all chunks for any inconsistencies (crc errors) and will mark any failed chunks as damaged, forcing LizardFS to fix them