Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git on Windows client corrupts files > 4Gb #2434

Closed
obe1line opened this issue Jul 24, 2017 · 69 comments · Fixed by #2459
Closed

Git on Windows client corrupts files > 4Gb #2434

obe1line opened this issue Jul 24, 2017 · 69 comments · Fixed by #2459
Labels
git-core help-wanted The core team would like assistance in implementing this feature. windows

Comments

@obe1line
Copy link

When cloning a file larger than 4Gbyte from a BitBucket server repository (LFS enabled), the file is not reconstructed correctly E.g. a 6Gb file results in a 700Mb file. The lfs/objects folder contains the correct file however.

Steps to reproduce:

Server: Basic install of Debian with Bitbucket Server 4.6, Git 2.13 (64-bit)
Client 1: Ubuntu 16.02 64 bit, Git 2.13 and Git-LFS 2.2.1 (both 64-bit)
Client 2: Windows 64 bit (2012), Git 2.13 and Git-LFS 2.2.1 (both 64-bit)

  1. Create a repository on the server with LFS support enabled
  2. Clone the repository (git clone ssh://git@server:7999/tst/test-git.git)
  3. git track '*.iso' , commit and push to the remote
  4. add an iso file larger than 4Gb (I used Visual Studio 2013 Update 4 which is 5.82Gb), commit and push to the remote
  5. Clone the repository into a clean folder

Client 1 works correctly
Client 2 pulls down the file correctly into .git/lfs/objects/aa/6d/aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049 but does not reconstruct/copy it correctly to the destination folder (it results in a 1.82Gb file).

Atlassian have looked into the problem and believe that the BitBucket server is working correctly, due to the fact that the correct content is retrieved over the network into the temporary object file (the CRCs match the original file).

Note that no Git configuration has changed (smudge filters etc are the default).

If the file is removed and "git lfs pull" performed, the file is created correctly. Using "git lfs clone" also works.

@technoweenie
Copy link
Contributor

First, you can confirm that BitBucket is sending the data correctly by checking the file in .git/lfs/objects. If this checks out, the server is doing things correctly:

$ cd .git/lfs/objects/aa/6d
$ shasum -a 256 aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049
aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049  aa6d2a8e9acbb78895b3d2c6ae3cb0db737344aa82b2859d31f757deec931049

Next, the fact that it works with git lfs clone and git lfs pull means the Git LFS code for copying files works. This ultimately comes down to lfs.LinkOrCopy(), which either makes a hard link (if .git/lfs/objects is on the same partition as your working dir) or manually writes the bytes to the new location.

So, this leaves the git filters. There are two modes that could be causing problems:

  1. The process mode is default in your reported git and lfs versions. The process mode uses a protocol to receive smudge requests from Git via STDIN, and to deliver object contents from LFS via STDOUT.
  2. The older smudge mode is basically a request from Git to get the contents for a single file. It may be worth triggering the smudge filter directly to see if it does the same thing:
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://git-server/your/repo
$ cd repo
$ cat path/to/file
version https://git-lfs.github.com/spec/v1
oid sha256:98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4
size 3

$ cat path/to/file | git lfs smudge > smudged-file.bin
$ shasum -a 256 hi.txt
98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4  hi.txt

Based on that, I think one of the following could be happening:

  1. Git LFS is not sending the correct contents through the filter process. I imagine this would've caught by now, as it would affect every object.
  2. The file contents aren't being piped through STDOUT correctly. Maybe there are some special characters in your files that are stripped or something?
  3. Git is not receiving the file connects correctly. You can confirm this by running GIT_TRACE_PACKET=1 git clone or something, but the output would be massive for your files :)

Some questions:

  • Are you experiencing this with a specific type of file, or just files over a certain size?
  • Are the corrupt files being truncated? If a 6GB file outputs a 700MB file to your working directory, does it match the first 700MB of the 6GB file?

It'd be really helpful if we could get a sample file that exhibits this behavior. I imagine that's a no-go, so we may have to come up with a special build of LFS with special tracing powers. @ttaylorr, any thoughts? Did I miss any debugging questions or trial commands to run?

@ttaylorr
Copy link
Contributor

@technoweenie that looks pretty comprehensive. My hunch is that it's related to one of the three issues you described as being process filter-related.

@obe1line do you have a copy of the file or repository that you could share? I think that would be the easiest way for me to debug this going forward.

@technoweenie
Copy link
Contributor

Hey @obe1line, I ran this by a Git core dev, and he mentioned that Git on Windows does not support files over 4GB. Unfortunately, this is not something we can fix in LFS. The best we can do now add a warning when large objects are added.

As a workaround, I think you should disable smudging completely:

$ git lfs install --skip-smudge
$ git lfs env
... snip
git config filter.lfs.process = "git-lfs filter-process --skip"
git config filter.lfs.smudge = "git-lfs smudge --skip -- %f"

After that, you'll have to run git lfs pull any time you change branches or fetch updates from your remote.

@ttaylorr I think we should add an early warning in the filter smudge and process code, perhaps linking to a page offering this workaround.

@obe1line
Copy link
Author

@technoweenie Thanks for the detailed comment.
In answer to your questions:

  • All files over 4Gb are affected, not just specific types
  • Yes, files are truncated with the 700Mb matching the original file
  • Copying the file from .git/lfs/objects/ and renaming matches the original file

I used the GIT_TRACE_PACKET, GIT_TRACE and GIT_CURL_VERBOSE to produce output previously and yes, it is rather large (~12Gb from memory).

@ttaylorr It can be reproduced with any file >4Gb, nothing special about the repository. The file is pushed with the Linux client, and fetched with the Windows client - I assume if I had used Windows to push, then the file may not have transferred fully into the repository.

@technoweenie Is the workaround only applicable to fetch? i.e. would uploading a file via "git lfs push" work correctly even with the Git 4Gb limit?

@technoweenie
Copy link
Contributor

Is the workaround only applicable to fetch? i.e. would uploading a file via "git lfs push" work correctly even with the Git 4Gb limit?

It probably won't work on Windows. Added files are passed through the LFS clean filter (basically the reverse of the smudge filter) are probably subject to the same size limitation in Git.

@ndebard
Copy link

ndebard commented Mar 1, 2018

I'm seeing this on Ubuntu that is running on a Windows machine but is rebooted with Grub. This may have something to do with the partition rather than windows itself? See attached photo.
20180301_021229

@ttaylorr
Copy link
Contributor

ttaylorr commented Mar 3, 2018

Hi @ndebard -- thanks for commenting here. I think what your experiencing is the correct behavior, even though the output can be a bit confusing. This message appears when a file greater than 4.0 GB is copied into your working tree (it looks like data/dump_006_ls_0.dat is 9.0 GB). The reason this message occurs is to warn when checking out that same file on Windows, there may be issues with certain versions of Git.

Though the message isn't pertinent to you on Ubuntu (?), it will have relevancy for any colleagues of yours using Windows.

@dwall17
Copy link

dwall17 commented Mar 7, 2018

Hi @ttaylorr
I'm very new to using Git so I just want to gain further clarification on your response here. If I'm correct, what you're saying is that this "malformed smudge" output isn't an error message, but instead a warning to Windows users pertaining to their version of Git? Considering that I'm using Ubuntu, would it be safe to simply ignore the message?

@ttaylorr
Copy link
Contributor

Considering that I'm using Ubuntu, would it be safe to simply ignore the message?

It is safe to ignore the message for yourself -- since you're not on Windows, your repository should work as expected even though it contains large files in the working copy. The warning is to remind you that a checkout of your repository may not work on a Windows machine.

@luckydonald
Copy link

luckydonald commented Apr 4, 2018

Could someone please clarify what this smude filter does, and how turning it off makes windows able to store bigger files?
Or is large file system not able to process large files over 4 GB at all? Would the common git be able to store them files?

@ttaylorr
Copy link
Contributor

ttaylorr commented Apr 5, 2018

Could someone please clarify what this smude filter does

Hi @luckydonald, thanks for asking! The smudge filter is applied to transform content from your index into your working copy. In practice, this means that we take the small reference (pointer) that LFS actually stores in your Git repository, and transparently convert it into a large file on your disk, so that it appears as if the large file itself is present in the repository.

and how turning it off makes windows able to store bigger files?

I don't think that turning off the smudge filter would make Windows able to store bigger files. The issue is rather that Git has a limitation on Windows of not being able to correctly smudge files when the size of the outgoing content is larger than 4GB.

This isn't an inherent limit of the file system, rather an implementation detail of Git.

@shabbyrobe
Copy link

So there's no viable workaround for this? Basically, if you have 4GB files in your repo, you can clone it if you disable the smudge filter, but you can't commit or push?

@ttaylorr
Copy link
Contributor

Basically, if you have 4GB files in your repo, you can clone it if you disable the smudge filter, but you can't commit or push?

Not quite. The issue is with 4 GiB files of any source, them coming from Git LFS is only one half of the problem. If another filter puts them there, or that's how they're stored in your Git repository, then it is not guaranteed that it will be checked out correctly by Git on Windows.

With regards to the Git LFS-part of that problem, if you have a >4 GiB LFS object (read: not a Git object, but an LFS one), you can avoid introducing that into your local copy by passing --exclude=path/to/file (or lfs.fetchexclude=path/to/file in your .gitconfig). With either (or both) of these options passed, Git LFS will not download or check out the large file into your working copy, thus side-stepping the problem.

One thing that I think is important to remember, is that this issue does not cause problems on Unix, macOS, or other platforms that don't have the >4 GiB file-size limitation. So, if you have a >4 GiB file in your repository (LFS or otherwise), it should work fine on platforms other than Windows. If you're on Windows, we are stuck with this behavior, so --exclude is the best way forward, IMHO.

@chowey
Copy link

chowey commented Nov 3, 2018

For the record, I have been using Git LFS on Windows by disabling the smudge filter and the process filter. Files >4GB seem to work fine. It just means you need to manually git lfs pull any time you pull, switch branches, or clone a repo.

@JohnFrampton
Copy link

First: my colleges an me encounter that problem on windows 10 with newest NTFS filesystem which is without any doublt capable of handling files > 4 GB and even files of size up to 16Exabytes (see here http://www.ntfs.com/ntfs_vs_fat.htm).

Second: the newest git (including lfs feature) is great.

Third: Thanx for all those proposals for avoiding the problem, but at the end we don't want to avoid to use/download/clone/pull files > 4 GB.
We want git (lfs) to handle them as lfs files correctly.

Is there a plan and timeframe to fix that problem on windows?

@bk2204
Copy link
Member

bk2204 commented Nov 6, 2018

So as I understand this issue, it's due to Git on Windows not supporting files greater than 4 GB properly. The issue is that the smudge and clean filters are invoked by Git, and Git itself doesn't handle this gracefully. Git LFS does handle this gracefully, but because it's invoked by Git (unless you disable the filters), the data is corrupted before it makes it to Git LFS.

To explain the issue with Git, it's because the Git codebase uses unsigned long for certain values. On a 64-bit Unix system (including Linux and macOS), that type is 64 bits in length, and unsigned long is the canonical way to write a system-sized unsigned word type. However, on Windows, unsigned long is always 32 bits. Consequently, even a 64-bit Git on Windows doesn't handle large files. There's an explanation of this issue in a thread on the Git list.

Git for Windows is already tracking this issue as git-for-windows/git#1063. The good news is that when this is fixed in Git, everything should automatically work with any version of Git LFS. In the mean time, there isn't anything we as Git LFS developers can do to fix it.

@chowey
Copy link

chowey commented Nov 6, 2018

@bk2204 That sounds correct to me. I just wanted to reiterate that the workaround from @technoweenie does work:

As a workaround, I think you should disable smudging completely:

$ git lfs install --skip-smudge
$ git lfs env
... snip
git config filter.lfs.process = "git-lfs filter-process --skip"
git config filter.lfs.smudge = "git-lfs smudge --skip -- %f"

After that, you'll have to run git lfs pull any time you change branches or fetch updates from your remote.

So it is possible to still work with Git LFS on Windows for large files, but you must disable smudge or your files will get corrupted.

@JohnFrampton
Copy link

Git for Windows is already tracking this issue as git-for-windows/git#1063. The good news is that when this is fixed in Git, everything should automatically work with any version of Git LFS. In the mean time, there isn't anything we as Git LFS developers can do to fix it.

Thank you very much to clarify the situation. I think it was necessary to state quite clear what we "Big File Users" are waiting for :-)

@aggieNick02
Copy link

@ttaylorr @technoweenie Unless I'm mistaken, the workaround of disabling smudge only sortof works. If you have a nice big new file upstream, and you git pull/git lfs pull, everything is great. But then if you go git checkout HEAD~1 followed by git checkout master, the checkout of master will fail as it will try to smudge the big file because it is already present in .git/lfs/objects.

To recover from this, you basically need to nuke your .git/lfs directory, as the presence of the file in the local lfs objects folder means smudge runs for it even though the skip options are set.

One important caveat: I'm on an older git/git-lfs version. I'll upgrade tomorrow, but just based on comparing the old and current source of git-lfs, I don't expect different behavior. If what I'm talking about sounds wacky/not the behavior you currently expect, perhaps it has already been fixed.

@tardyp
Copy link

tardyp commented Feb 1, 2019

Hey. This issue should not be closed as the windows client still corrupts files > 4GB.
This is just depending on another bug in git.

@ttaylorr
Copy link
Contributor

ttaylorr commented Feb 5, 2019

This issue should not be closed as the windows client still corrupts files > 4GB.

If I recall correctly, this is not related to a bug in Git, but rather is an inherent limitation of the Windows filesystem.

@shabbyrobe
Copy link

The issue as described in the Git for Windows bug mentioned earlier in this thread (git-for-windows/git#1063) points to a problem with incorrect datatypes in the Git code, not with Windows filesystem limitations:

The problem is the Git source code, which uses unsigned long in places where size_t would be correct.

@ttaylorr ttaylorr reopened this Feb 5, 2019
@bk2204
Copy link
Member

bk2204 commented Feb 5, 2019

If I recall correctly, this is not related to a bug in Git, but rather is an inherent limitation of the Windows filesystem.

The limitation on files larger than 4 GiB is for FAT, but not NTFS. NTFS is capable of large files, but you're correct that if you're using a flash drive for your LFS-using Git repository, then you probably have a file system limitation. I think that most Windows users are using NTFS for their systems, though.

I believe in this case the issue is as @shabbyrobe quoted: we use unsigned long, which is 32 bit on 64-bit Windows, instead of size_t, which is 64 bit. While Unix systems have unsigned long equal to the pointer size (they are LP64 systems), Windows always has a 32-bit long (the LLP64 model).

There are a handful of patches going into Git 2.21 to address some of these issues, although they may not be complete, so it may be useful to keep an eye out for improvements in that regard.

@aggieNick02
Copy link

@PhilipOakley, unfortunately yes, the 4Gb limit is still an issue,

I first want to find an automated workaround.

I'm not sure why the post-merge hook isn't running.

Also note that even if it does run, there may be other scenarios (adding a new lfs file, doing a checkout of an older commit hash which has a previous version of a large file) that may still cause problems and not be solved with a post-merge hook. At minimum those scenarios should at least be checked before rolling an automated workaround out to developers with the hopes that the devs won't have to worry about the >4GB issue.

@marbx
Copy link

marbx commented Mar 2, 2021

@aggieNick02, the post-merge hook seems to be omitted when git pull performs fast-forward. If I set merge.ff false, the post-merge hook runs, but only at the cost of a merge and a merge message.
In your two scenarios: why would a manually executed git-lfs pull create no problems?

@doctorpangloss
Copy link

I just really want to know what we're supposed to do to use 4GB+ files in Git.

@doctorpangloss
Copy link

Can git-lfs just break up files larger than the 4GB limit into smaller ones, and reassemble them in its smudge or whatever?

@bk2204
Copy link
Member

bk2204 commented Aug 23, 2021

I just really want to know what we're supposed to do to use 4GB+ files in Git.

Well, if you're using Git LFS, you can either use --skip-smudge as mentioned above, or you can use WSL to check out the files. This limitation in Git only applies to Windows versions, not Unix versions.

Can git-lfs just break up files larger than the 4GB limit into smaller ones, and reassemble them in its smudge or whatever?

No, because the problem isn't Git LFS. The problem is that Git itself will truncate these files, so Git LFS reassembling them will still result in Git truncating them.

@doctorpangloss
Copy link

Well, if you're using Git LFS, you can either use --skip-smudge as mentioned above, or you can use WSL to check out the files. This limitation in Git only applies to Windows versions, not Unix versions.

Thank you for the follow up. git lfs install --skip-smudge doesn't change any behavior for me, in a newly cloned or existing repo. I still receive an error message. On the flip side, my files do not appear to be truncated either way, so this is all very puzzling.

@aggieNick02
Copy link

aggieNick02 commented Aug 28, 2021

Thank you for the follow up. git lfs install --skip-smudge doesn't change any behavior for me, in a newly cloned or existing repo. I still receive an error message. On the flip side, my files do not appear to be truncated either way, so this is all very puzzling.

Even when using the workarounds, you will still get warnings from Git LFS about the possible problem. But it typically doesn't actually happen with the workarounds.

@aggieNick02
Copy link

@aggieNick02, the post-merge hook seems to be omitted when git pull performs fast-forward. If I set merge.ff false, the post-merge hook runs, but only at the cost of a merge and a merge message.
In your two scenarios: why would a manually executed git-lfs pull create no problems?

@marbx Apologies for not responding months ago - saw and thought about your question with the recent activity.
I don't think a manual git-lfs pull would behave any differently than one done automatically via a hook. But I've run into scenarios (like pushing after adding and committing and lfs file, or possibly checking out an older commit) where having the smudge and process filter disabled will put your client in a bad state. There are ways to recover, but the workaround isn't foolproof. Despite that, I've successfully been using Git LFS on windows for at least a couple years with >4GB files, running my own LFS server.

@marbx
Copy link

marbx commented Aug 28, 2021

Hi @aggieNick02 , thank you for your reply. For me, sadly, foolproof is mandatory.

I recommended to stop using skip-smudge after it corrupted a ps1 file, maybe because Windows (Server 2012R2) assumed UTF16. Even before, skip-smudge was perceived as process risk.

Until someone (or a group) fixes (or obsoletes) the "Git wrongly assumes that long in C has always 64bit" root cause, 4GB is unavailable for us.

@PhilipOakley
Copy link

Until someone (or a group) fixes

If you (or your team, or where you work) are/is in any able to help then that would be useful.

Even the project scope isn't sufficiently fleshed out to identify a sub-MVP (smallest demonstrable progress - SDP?). I have lot's of personal notes, but it's quite a big job..

@marbx
Copy link

marbx commented Aug 29, 2021

Hi @PhilipOakley , I guess the team "at hand" is more suitable than that "at work".

I believe I read your rejected PR in which you replaced long with long long.

I tried to identify a smaller change set by running a debugger with stop points on any long, but I don't found a suitable command to debug. What do you think of that?

Do you have a note about a planned "direct file to file" method to copy files from lfs to git? I only have a vague memory.

@PhilipOakley
Copy link

Hi @marbx,

My 'at work' aside, was for those readers who might have a work place that would benefit from a >4GB resolution who could maybe get their local management to let them do a few hours on company time, on a win-win basis. Often there is more flexibility than one may imagine (companies cost things funnily;-)

Which PR were you looking at (so we are talking about the same one). The idea, as per the C89/99 standard, is to use size_t as the memsize type (rather than long long) so that it is portable between LLP32 and LP64 systems which have a different size for the int and long types.

I don't have any notes on '"direct file to file" method to copy files from lfs to git'. I had been just focussed on the internals of Git/Git-for-Windows.

Having been a Systems Engineer I tend to work from outside in (big picture), while git patching tends to work from inside to out (fine details picture). So I tend to want to be able to know when the job is complete, rather than small items started.
I hope to show a draft of notes soon once I've added a bit of structure.

One thought of an approachable activity is to look at commands which should NOT involve the 4GB limit and check that it's actually true, or just annotate(split) the list of commands into that same two groups (e.g. shouldn't, maybe, and probably, for 4GB testing - yes that's three, but it should be just two ;-)

@marbx
Copy link

marbx commented Aug 31, 2021

Yes, I meant your PR with size_t

@marbx
Copy link

marbx commented Aug 31, 2021

Has anybody else a memory about a direct file to file transfer?

@KalleOlaviNiemitalo
Copy link

Perhaps it refers to git lfs checkout? That would not rely on the smudge filter, I think.

@PhilipOakley
Copy link

Yes, I meant your PR with size_t

So that's git-for-windows/git#2179
I was just making sure we were on the same page.

@doctorpangloss
Copy link

doctorpangloss commented Sep 1, 2021

Hi @marbx,

My 'at work' aside, was for those readers who might have a work place that would benefit from a >4GB resolution who could maybe get their local management to let them do a few hours on company time, on a win-win basis. Often there is more flexibility than one may imagine (companies cost things funnily;-)

Which PR were you looking at (so we are talking about the same one). The idea, as per the C89/99 standard, is to use size_t as the memsize type (rather than long long) so that it is portable between LLP32 and LP64 systems which have a different size for the int and long types.

I don't have any notes on '"direct file to file" method to copy files from lfs to git'. I had been just focussed on the internals of Git/Git-for-Windows.

Having been a Systems Engineer I tend to work from outside in (big picture), while git patching tends to work from inside to out (fine details picture). So I tend to want to be able to know when the job is complete, rather than small items started.
I hope to show a draft of notes soon once I've added a bit of structure.

One thought of an approachable activity is to look at commands which should NOT involve the 4GB limit and check that it's actually true, or just annotate(split) the list of commands into that same two groups (e.g. shouldn't, maybe, and probably, for 4GB testing - yes that's three, but it should be just two ;-)

I'll happily fix this stuff. Since I live in the here and now I would probably just merge your patch if it passes tests, and otherwise fix the issues the tests show. Based on my cursory reading it sounds like there are some style issues and the patch splitting thing. You don't have to repeat / rehearse for me though.

git should work with large files on Windows, there seems to be a belief this isn't a big deal, IMO this is a hair on fire problem with Git-for-Windows. A bajillion things use source control and LFS on large files together like game development and static website content.

My team installs the git for windows SDK for a good development environment. If it's as easy as a git pull somewhere...

@PhilipOakley
Copy link

I'll happily fix this stuff. Since I live in the here and now I would probably just merge your patch if it passes tests, and otherwise fix the issues the tests show.

That would be a useful step. Even if you just pull out the zlib changes and (using the git CI) confirm that regular git still works OK, allowing those parts to be [ready for] upstream early.

You don't have to repeat / rehearse for me though.

That's excellent.

some style issues and the patch splitting thing.

Yep, that's part of the long-run issue. But if you start somewhere it is still valuable.

@PhilipOakley
Copy link

@doctorpangloss

One command that may be worth starting with, as it has a core element to it, is the git hash-object command.

Pick the most simple of options to limit the areas of the code base that it could explore (e.g. tweak the BigFileThreshold so that you don't have to hit the pack file code). This should at least avoid the combinatorial explosion of interdependencies.

@mastercoms
Copy link

Is this fixed in Git 2.34?

@bk2204
Copy link
Member

bk2204 commented Nov 16, 2021

It is fixed in Git for Windows 2.34, but not in Git 2.34. The patch was specifically applied to Git for Windows, but has not been released in upstream Git yet.

@mastercoms
Copy link

Oh I thought that this was a Windows only bug, sorry! Cheers

@bk2204
Copy link
Member

bk2204 commented Nov 18, 2021

It is indeed a Windows-only bug, but in the event you're building your own Git on Windows from the Git source, not the Git for Windows source, then there's a difference. Probably 99% of users on Windows use Git for Windows, so for them, this issue will be fixed in 2.34.

@bk2204
Copy link
Member

bk2204 commented Jan 7, 2022

Since this is now fixed upstream, I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
git-core help-wanted The core team would like assistance in implementing this feature. windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.