Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropbox synced library often loses bibtex/metadata #415

Open
astaroph opened this issue May 9, 2023 · 11 comments
Open

Dropbox synced library often loses bibtex/metadata #415

astaroph opened this issue May 9, 2023 · 11 comments
Labels
🐛bug Something isn't working 🧑‍🤝‍🧑help wanted🧑‍🤝‍🧑 Extra attention is needed.

Comments

@astaroph
Copy link

astaroph commented May 9, 2023

Hello,

I've been a big fan of qiqqa since the commercial days and have really been enjoying all the new features and especially the return of the sync capability to keep my libraries harmonized between my different computers. However, I've been running into a worsening issue where I'll finally have waded through all of the bibtex sniffing (and recurrent lockouts from google) and gotten my library up to date, and then on my next sign in (usually when I sign into the other device and especially after I have just synced), I will lose dozens or more of hard won bibtex data, for papers that clearly already had them (and have even been part of an Expedition already, see below.
Capture

This has been occurring as of a couple of years ago at some low level, with several versions of v83. I have tried reverting to v82 and simply maintaining separate libraries on my desktop and laptop that watch the same dropbox folder (have to sniffer each paper twice to keep them up to date, but at least it is stable), but it would be very nice to be able to actually take advantage of the new sync capability.
I know there is a lot going on under the hood and not enough of you all to go around, but any help here would be most appreciated!
Happy to drop in my logs (extensive, when last I checked, and with various warnings and errors that may be helpful).
Thanks!

@GerHobbelt GerHobbelt added 🐛bug Something isn't working 🧑‍🤝‍🧑help wanted🧑‍🤝‍🧑 Extra attention is needed. labels May 9, 2023
@GerHobbelt GerHobbelt added this to the Our Glorious Future milestone May 9, 2023
@GerHobbelt
Copy link
Collaborator

Ok, this is a "known issue" that's been bothering me for quite a while already. It's one of the reasons why I've decided to do a rewrite of the software, but that's mostly due to blocking problems in the first ~100 issues listed here at gitHub. Anyway, back to your observed problem:

What I would like you to try is this work-around (technology cause-and-effect elaboration later on in this writing):

Preface

  1. yes, it may seem rather nuts to have a LOCAL sync directory as that would mean we store the Qiqqa database + PDFs twice on a local disk. Let it be. Just don't ask and I shall tell ye no lies. 😉
  2. Make a backup before you do this. YMMV. Your most important file is the Qiqqa.library database, as that's where all the metadata is stored. bibTeX, etc. Better safe than sorry.

The Action

  1. set qiqqa up to SYNC to a LOCAL directory.
  2. SYNC to that directory. This should copy your entire library to the sync directory.1
  3. then copy those files by hand onto Dropbox.

the reverse process for the other machine(s):

  1. set the other qiqqa instance up to SYNC to (its own machine's) LOCAL directory.
  2. manually copy the files off DropBox into a similar LOCAL sync directory
  3. SYNC that qiqqa instance: if all goes well, it should see the updates and pull them in from the local SYNC dir.

Technologic background info

Writing that in the next response msg.

Footnotes

  1. Note for very observant folks: the sync dir will have a different database file. .s3db vs. .library & no Lucene.NET index directory as that stuff is recreated on every instance.

@GerHobbelt
Copy link
Collaborator

GerHobbelt commented May 10, 2023

Technologic background info

Why does DropBox "nuke" our Qiqqa databases?

Well, that's rather vague and a very tough nut to crack, regrettably.

First there's this: Qiqqa uses the well-known SQLite database library (a .NET port/derivative in fact, but the argument remains as-is) and apparently DropBox, Google Drive and a few others (cloud storage providers) have a hard time dealing with SQLite databases when these are read/written by SQLite, i.e. when these database files (Qiqqa.library in our case, BTW) are updated during the Qiqqa SYNC process, which is basically a large database sync action.

Why does this go wrong?

Frankly, I'm stumped. Theoretically this should not happen as SQLite uses regular random access APIs, so there's nothing fancy happening - or so you might think.

Actual reality contradicts this theory.

There's two things:

  • SQLite FAQ says it's highly discouraged to read/write to databases that are located on network storage.

    • https://www.sqlite.org/useovernet.html -- note that Qiqqa currently has not been verified to confirm to "Recommendations" 2 and 3 in there, but independent tests with other SQLite-based software leads me to believe DropBox & friends are doing something extra nasty anyway, as I've seen "impossible" corruptions during those tests.
    • https://www.sqlite.org/lockingv3.html#how_to_corrupt goes way further down that rabbit hole...
    • https://www.sqlite.org/faq.html#q5 -- FAQ nr. 5 may not seem relevant when you're a single user, but I've found it is a hint: I have learned the hard way that "file locking" on networked storage is generally flaky as Hell, and that's not just the old notorious NFS. I reckon -- without any proof what-so-ever1 -- DropBox & friends, while mirroring their goodies on your local drive, do behave somehow like obnoxious network children.
  • meanwhile it may look to the layman that DropBox isn't a "network drive" per se, but rather a layer over a local file mirror. That's what I thought initially too, but it turns out DropBox makes those areas of local storage behave ever so slightly differently.

Tests thus far have led to these findings (which are rather fuzzy for these tests are irritatingly ill-willed to produce consistent results: one day you fare better or worse than another...):

  • random access is not always succeeding as you might expect, it seems. While I've never seen the issue with Word files and the like, I saw rare errors when you really start to push the buggers. Which is what SQLite does, essentially, when you do several/many database record updates: while SQLite guarantees crash-safe database management on a pure local drive, it can get some "weird errors" when you do exactly the same with a database file that's shared via DropBox et al. My hunch is that DropBox fucks up somewhere internally where it tries to detect which chunks (sectors) of the file need to be synced=sent across the wire and in what direction those should travel. Treat this as a still-open question.

Meanwhile Qiqqa isn't helping either as it has its own little nastiness built-in since day one (see also https://github.com/GerHobbelt/qiqqa-revengin#database-file-format-as-discovered-via-db-inspection): Qiqqa, for some very odd reason (apparently they didn't trust their own software at some point, I guess) adds a MD5 hash checksum to every bibTeX metadata record stored in the database.

Fine, that should always succeed am-I-right? Yes, usually. Unless... you start hacking the database from outside Qiqqa (which I was doing back in the day when qiqqa was still commercial) or when someone does the hacking for you: enter DropBox&friends at-times-odd-behaviour and to qiqqa that looks like "record corruption" as SQLite (temporarily or permanently) produces a b0rked record for a given query. The measure taken in the software is to DELETE the metadata record. Ouch

Okay, so what if we keep those records intact, i.e. take out the MD5 checksum check? Well, that has passed my mind as well, and I tried that one for a while too. But the corruption that DropBox manages to induce into SQLite databases is such that I then get spurious crashes and odd(er) behaviour elsewhere in Qiqqa, so that bright idea turns out to be... not so bright.

Is DropBox evil, then?

Well, insert a bit of handwaving here, but at least it works for what it's designed for, and that's file copying. Which is a sequential access pattern, rather than random access.

And that translates to...?

That means we MUST NOT use SQLite to execute the SYNC action onto DropBox et al; instead that part of the "sync" action should be a regular "file copy" action, to make sure you're not toasted by DropBox et al. Which is what the work-around described in the previous message aims for:

  • always get a file copy to produce a local copy of the remote storage.
  • then apply SQLite (or whatever we like or need) to upgrade that local copy to the state we want it to be, i.e. perform a "sync operation".
  • then copy the result back onto DropBox/what-have-you storage, using a sequential access copy action.

For advanced folks, there's rclone and similar tools, which can take care of that "file copy to/from Dropbox" action. Those are tools which are focused on Linux, but there's Windows versions too. At some point, I hope to roll my own dedicated Qiqqa sync tool, based on their technology and with extra built-in checks for I've run into some other file copy/sync troubles along the way (independent of Qiqqa) :: it turns out that at least one of my machines sometimes, if only rarely, exhibits a CRC failure when writing data to certain magical combinations of USB disk tray and HDD. So a "file verify-after-copy" is in order, too. ... As if the situation wasn't complicated enough already. 😜

Anyway, what we should aim for, technology-wise:

  1. Apply SQLite on local files only.

    Note: This is not a SQLite problem per se, by the way. All other databases suffer from the same issue. The Big Boys have come up with different approaches to solve this problem we are facing here, though: network protocol based access to the database instance. Which is off-limits for me & Qiqqa and I want to keep this a desktop application and able to use arbitrary network access, i.e. I hope to be/remain able to provide Qiqqa 'sync' functionality across the least common denominator of networked data sharing. Which is... DropBox et al. Yeah, painting oneself into a corner there. But that's what we need to do, when we want to provide non-tech-savvy users with an application that can "share" a library across multiple machines. 🤷

  2. Apply other "sequential access" software / methods to copy those synced files across the given network barrier, e.g. rclone onto & off Dropbox/GDrive/OneDrive/Mega/...

Alas, the current state of affairs is that we haven't yet produced such a sync system, hence the manual work-around attempt in the previous message. 😢 My apologies....

If we dig into that concept a little deeper, we find several issues that need solving, not all of which are resolved by using 'rclone' as the answer:

  • given DropBox et al finicky behaviour when syncing files, there's a reasonable chance that some sync action can fail in various ways, including contention/collision due to another user sharing the same files and syncing/updating them around the same time: rclone does not protect you from disaster there, so we need to check the integrity of the files we obtain from the cloud that way. Then we need to decide what to do when we discover such corruption. (discard & keep previous version or refetch later?)

  • rsync and unison are tools well known in Linux Land when it comes to synchronizing machines across a network. However, those all assume a higher grade of "least common denominator re network access" then we do. Enter rclone, which addresses the same area of need as we have. (But it lacks a proper file content verify action, so there's still some chance of corruption getting through -- see my note about the USB bay+HDD observation above)

  • multi-user Qiqqa is another question that I reckon with too: we need collision resolution when multiple people happen to edit/change records somehow, or rather: when such edits are discovered to collide at sync time: we are not 24/7 networked with Qiqqa (and I don't intend to be: it should work stand-alone; similar to git for software developers) so "simultaneously editing users" can easily become "edited the same data within the same 24-hour window" or such-like. There are automated collision resolution protocols out there (commercial Qiqqa used such an idea: last one wins), but I dinf my own metadata is precious enough that I don't accept the side-effects of those automatic algorithms, which is: sometimes the algorithm can decide one of the incoming data chunks is discarded, which means you loose an edit permanently. Hence we'll need something like git mergetool style manual (and machine-assisted) merge/collision handling. Bu that for later this year....)

    See also https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type & note that stuff is potentially throwing away stuff. Hence I'm of the opinion that we need a partially-manual process a la git mergetool

While I digressed into the area of collision resolution, CRDTs and git mergetool-like manual supervision approaches, these are sideways relevant to the issue at hand: a proper Qiqqa SYNC actually would need a sequential copy to/from network (git does this, and so do rclone et al file copiers), while the local-to-local mostly-SQLite-based sync should be manually supervisable to solve "merge conflicts". The extra hair in the soup is that we then still need to make sure our local SYNC does not collide with someone else in the team doing a similar SYNC on their machine during the same time period. So this becomes a kind of iterative sync mechanism and the additional conclusion that we need "data versioning" (for three-way merging tactics for when we need merge/sync conflict resolution). Qiqqa currently does not do "versioning": it only keeps that last version of your metadata records, so we're looking at a database layout change as well, before this sync function may work as one might hope/expect.

Anyway, that's the whole of it, including the "future music" part that's team usage discussed in this latter chunk.

Footnotes

  1. I have tried to dig out this problem, but most times when I design or run tests, DropBox seems to "notice" and quite evilly rejects to fail on me. Yes,I am anthropomorphizing the bugger. Satan is among us, or something crazy like that. Thus far, the collective cloud storage services (DropBox, googleDrive, MS OneDrive) have proven smarter than yours truly. They only fail when I don't watch for failures or only trigger very rarely when I do, making me spend bloody ages on this. Sigh.

@astaroph
Copy link
Author

Well, I appear to have really stumbled into it, haven't I! Sorry to poke the wound, as it seems that what seemed a minor inconvenience to me has been a true thorn in your side!
Thank you for such a detailed and thorough answer. I have indeed been quietly scratching my head at some of these odd occurrences that I have seen pop up in the issues with dropbox and I can see that it indeed may be worthy of anthropomorphizing. I will try and implement the local sync with manual copy solution you suggest and report back.

However, at the risk of opening up another can of worms (and with none of the hard won expertise of the relevant programming issues at play), I will also put out an idea I just had based on what I believe I understand to be the issue (random access by SQLlite to a "Dropbox" folder proper). It would seem that dropbox also has the ability to not just automatically sync its own folders but also automatically "back up" a local folder like Documents or Desktop. I'm going to wildly assume that this "back up" operation is treated differently from whatever black magic they do in their own folders and hypothesize that if I tell drop box to "back up" a local sync folder that I will place in "Documents" or something benign like that, then I will be automatically be able to back up from said folder on my other machine to a similarly benign place that can also be watched. It may involve having two different online "backups" hosted by dropbox, but if its any easier and doesn't lead to corruption, maybe worth it?
More once I've had the chance to test things, and thank you so much for your time, expertise and help!

@astaroph
Copy link
Author

Though I suppose this may easily end up just as manual as copying and pasting if it requires manual downloads...

@klmckinney
Copy link

I believe that within the SQLITE database there is just one table. If this is truly the case, is SQLITE truly a necessity? Why use a relational database if you have no relations? Is it just being used as a cleaver storage manager? There might be better tool and/or data structure to use for handling the data that is more suited for cloud/network storage.

I am happy to help with this issue but I have to point out that I am mostly baffled by the logic of what Qiqqa is doing!

@GerHobbelt
Copy link
Collaborator

@astaroph:

It would seem that dropbox also has the ability to not just automatically sync its own folders but also automatically "back up" a local folder like Documents or Desktop. I'm going to wildly assume that this "back up" operation is treated differently from whatever black magic they do in their own folders and hypothesize that if I tell drop box to "back up" a local sync folder that I will place in "Documents" or something benign like that

Certainly worth a try!

Keep in mind that what I explained is not proven; given my troubles (and apparent failure) to get these problems to show up in a testable/analyzable manner, we're reasoning based on conjecture, so be ready for "surprises".

Side note: it might be handy to copy the qiqqa.library database to local backup once in a while: that might at least give us a way to compare databases and discover which metadata (bibtex) record got lost. (I'll take some time before I'll have tools to recover & re-import such raw backup data, but it's certainly on the list, as I have several like that myself from the years past, thanks to Qiqqa continuously crashing with my (large) library. (happened ever since commercial v69; multiple layers of bugs...)

@GerHobbelt
Copy link
Collaborator

@klmckinney:

I believe that within the SQLITE database there is just one table. If this is truly the case, is SQLITE truly a necessity? Why use a relational database if you have no relations? Is it just being used as a cleaver storage manager? There might be better tool and/or data structure to use for handling the data that is more suited for cloud/network storage.

I am happy to help with this issue but I have to point out that I am mostly baffled by the logic of what Qiqqa is doing!

Thanks. :-) (We should set up a voice call soon, faster discussion that way.)

re SQLite: yup, 1 table. Jimmy apparently used SQLite as a glorified key-value store. Why, I don't know, haven't talked to him, so I can only guess. (I have some assumptions, but actual history may be quite different from those guesses of mine.) Just a few comments for now as I'm way overdue for bed (3:49 AM here; gotta work tomorrow, so ho-hummm):

  • current use of SQLite is indeed as key-value store. All sorts of records are dumped in that single table, which is certainly not, 🤔 😇 usual practice when you have a full-fledged SQL engine -- also back when Qiqqa was first created already.
  • wrapping bibTeX in JSON in a database column isn't.... obvious, either. 😉
  • SQLite has one advantage: on local storage, it's rock solid with respect to application and all sorts of system crashes, even acute power loss. None9 of the others offer that, and certainly not back when Qiqqa was conceived, so that makes SQLite a smart pick, no matter how much (or little) SQL/database mad skillz you happen to have. (SQLite is fast, besides, so that's another pro. Sure, LevelDB, LMDB, hamsterdb (one key-value NoSQL I did use back a few years around Qiqqa inception time), etc. are somewhat faster, but none of them were / are significantly faster when you configure them to be safe & sane in the face of application and system failure -- which is what you'ld want to pick when you design an app that's collecting & managing costly data (editing metadata is human effort that comes at great cost).
  • I will stick with SQLite for the future as (1) I like the little bugger, and (2) intend to depart from the full-bluntal-nugity-approach that's current qiqqa's raw bibTex-wrapped-in-JSON-dumped-in-database-column-BLOB approach. 😉 🤧
    • plus I wish to use SQLite WAL mode; using it in another place and it's nice. tested LMDB as I once used TokyoCabinet for raw speed needs elsewhere and ran into LMDB later, but that one is not "cute" enough (speed, etc.) to be picked over SQLite (and it's ease of use thanks to SQL).
    • did I check other DBs? yup. Quite a few, actually (in the corrona years). LevelDB & friends get hyped a lot, but those are designed for write-mostly usage; they fail to deliver in read-often usage patterns where tools like SQLite shine. In the end, when things are fast enough, speed isn't everything, and one of the ideas/wishes I have for Qiqqa is that my rebuild of it should open up the opportunity for users to directly access their own library database for meta-research or whatever they fancy; similar to what you are doing, I believe. Anyway, opening this access up for users, means some sort of easy access is handy and SQL is a great boon then: saves me another bundle of effort to produce an API. That moved me towards MySQL/MariaDB/PostgreSQL (here I like the latter most), but then SQLite was still the easiest/fastest to use, I think. But, yes, that's not set in stone; I like to keep SQLite, bu the "direct user access" means I'll have to allow some sort of parallel access, for which MariaDB & PostgreSQL are more suited, out of the box. When I move away from SQLite, it'll be PostgreSQL, but I have to reckon with Qiqqa users who very non-IT-tech-savvy and PostgreSQL is very nice, but SQLite is... easier installed? n the end, it's personal preference & inertia there, I suppose.

@GerHobbelt
Copy link
Collaborator

Oh, and always keep in mind: all databases (including the key/value NoSQL ones) are engineered for local disk storage only. None of them have been engineered to sit on top of cloud storage. The way the "big boys" do it, is run multiple instances (phrases such as "sharding" or "partitioning" when you're more into Oracle et al) and demand a solid network setup1, so nothing flaky like consumer-level cloud storage access through arbitrary ISP hardware (NATs, routers2, etc.) and applications you have no real control over.

Meanwhile, Qiqqa wishes to work in just such an unkempt environment.

The short end of this stick: whatever database, SQL or NoSQL, you pick, you can bet your bottom you'll be facing similar or just other crap happening out there -- remember, Qiqqa is used by several folks, who use it as a tool and are not "into computers for the sake of IT" 😉

-- one of the selling point of "running in the cloud", I guess. Thing is, I don't wanna go there. --

Footnotes

  1. preferably a network with an admin on call, so you'll know it'll be rock-solid as BOFH doesn't want to be woken at night 😉

  2. one example only tonight: when you've not been facing (Dutch) KPN Nexperia boxes at customer sites, like I have, than you still need to descend one further level in Hell. Those are "routers" that regularly crap up when you put some load through it (torrents are absolutely notorious for it, but no sweat: other somewhat heavy network access can "swamp" these buggers just fine and take out your network connections at random. Way different from your regular corporate router/switch setup where all those databases, SQL and NoSQL, want themselves to be found. (And, yes, the KPN says they now use better ones. Sure, honey. 🤣 )

@wwweagle
Copy link

An temporary/intermediate solution might be writing a new file (with a new name) every time the SQlite file updates. I got the impression that the SQlite-Dropbox pair created these headaches because they both tend to work on tiny fractions within a file. If you force them to drop their cleverness they may actually work. This will defy most of the points to use a DBMS in the first place, but it's like 3 lines of code and "editing metadata is human effort that comes at great cost" ¯_(ツ)_/¯

@klmckinney
Copy link

Hi @GerHobbelt ,
Did you see my direct email, of May 4th, about HEAD not building?
Is it worth opening an issue here to discuss?

@astaroph
Copy link
Author

Just an update here: I implemented your solution, @GerHobbelt, setting the sync point of each machine to a local folder (in Documents), fixing all of the missing bibtex and what not on the source library, syncing, manually copying over to dropbox, then updating the local sync point of the second machine with the dropbox files.
I just synced my second machine with this updated local backup and it found all the missing metadata! Didn't transfer over the expedition swatches but I can live with that.
I'll update periodically here if anything seems to be going wrong.
Many thanks for a workable solution, and I look forward to seeing how things develop going forward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛bug Something isn't working 🧑‍🤝‍🧑help wanted🧑‍🤝‍🧑 Extra attention is needed.
Projects
None yet
Development

No branches or pull requests

4 participants