Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: israel-lugo/capidup
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.0.1
Choose a base ref
...
head repository: israel-lugo/capidup
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: master
Choose a head ref

Commits on Jul 14, 2016

  1. Create unit test for issue #12.

    New file test_indexing.py, to test index_files_by_size(). New test
    test_nonexistent, to test indexing nonexistent files.
    israel-lugo committed Jul 14, 2016
    Copy the full SHA
    1b6663c View commit details

Commits on Jul 15, 2016

  1. Fix breakage caused by race condition. Closes #12.

    * finddups.py (index_files_by_size): Fix race condition when file is
    deleted between os.walk() seeing it and us calling os.lstat().
    * tests/test_indexing.py (test_nonexistent): Remove "expected to fail"
    mark.
    israel-lugo committed Jul 15, 2016
    Copy the full SHA
    ab4437d View commit details
  2. Merge branch 'index_files_by_size-race'.

    * index_files_by_size-race:
      Fix breakage caused by race condition. Closes #12.
      Create unit test for issue #12.
    israel-lugo committed Jul 15, 2016
    Copy the full SHA
    6a9d98e View commit details
  3. Copy the full SHA
    79d51d2 View commit details
  4. Copy the full SHA
    495a2cf View commit details

Commits on Jul 17, 2016

  1. Add CHANGELOG.rst file.

    israel-lugo committed Jul 17, 2016
    Copy the full SHA
    1901e47 View commit details
  2. Copy the full SHA
    d4bd235 View commit details
  3. Copy the full SHA
    b96185c View commit details

Commits on Aug 1, 2016

  1. Copy the full SHA
    53d4652 View commit details
  2. test_dups: Fix breakage on Python3.2. Encode strings in UTF-8. See #9.

    Python 3.2 doesn't support u"" prefix on Unicode strings. We have to use
    a transparent wrapper for compatibility.
    israel-lugo committed Aug 1, 2016
    Copy the full SHA
    8dc11fd View commit details
  3. Copy the full SHA
    fc3ee64 View commit details

Commits on Aug 6, 2016

  1. Copy the full SHA
    9c5fad7 View commit details

Commits on Sep 5, 2016

  1. Copy the full SHA
    55bae63 View commit details
  2. test_dups: Test with subdirs as well as flat.

    * test_dups.py (test_flat_find_dups_in_dirs): Renamed to
    test_find_dups_in_dirs. New argument flat.
    (setup_flat_files): Renamed to setup_files. New argument flat.
    israel-lugo committed Sep 5, 2016
    Copy the full SHA
    b415c70 View commit details

Commits on Sep 9, 2016

  1. Copy the full SHA
    9881afb View commit details
  2. find_duplicates_in_dirs: Fix documentation example.

    Wrong function name in example. Relative paths suggest 'dir1' and 'dir2' are
    subdirectories of '.', which is already in the list of dirs to scan.
    israel-lugo committed Sep 9, 2016
    Copy the full SHA
    2cc6b90 View commit details
  3. Implement regression tests for directory exclusion. See #10.

    Actual functionality is still not implemented.
    
    New file tests/test_excludes.py.
    israel-lugo committed Sep 9, 2016
    Copy the full SHA
    ca89609 View commit details
  4. Copy the full SHA
    034997d View commit details
  5. find_duplicates_in_dirs: New parameter exclude_dirs. See #10.

    Add hook to index_files_by_size to prune subdirs.
    
    Still need to implement the actual pruning.
    israel-lugo committed Sep 9, 2016
    Copy the full SHA
    761a0e0 View commit details
  6. Implement subdir pruning.

    Enabled the unit tests in test_excludes, as this is now implemented.
    israel-lugo committed Sep 9, 2016
    Copy the full SHA
    4884a32 View commit details
  7. Copy the full SHA
    a0c08a8 View commit details
  8. Merge branch 'exclude-dirs'. Closes #10.

    * exclude-dirs:
      CHANGELOG.rst: Document new exclude-dirs feature.
      Implement subdir pruning.
      find_duplicates_in_dirs: New parameter exclude_dirs. See #10.
      test_excludes: Mark xfail, not implemented yet.
      Implement regression tests for directory exclusion. See #10.
    israel-lugo committed Sep 9, 2016
    Copy the full SHA
    18e107a View commit details

Commits on Sep 10, 2016

  1. Copy the full SHA
    2c3c73e View commit details
  2. Copy the full SHA
    884da9f View commit details
  3. Split huge test_dups.py into separate files.

    * test_dups.py: Renamed to test_dups_full.py, some tests removed.
    * test_dups_simple.py: New file, from some tests.
    israel-lugo committed Sep 10, 2016
    Copy the full SHA
    24b63cf View commit details
  4. Copy the full SHA
    5b99484 View commit details
  5. find_duplicates_in_dirs: improve docstring example.

    Also improve explanation of exclude_dirs for index_files_by_size.
    israel-lugo committed Sep 10, 2016
    Copy the full SHA
    54c579b View commit details
  6. Implement unit tests for file exclusion. See #15.

    * test_excludes.py (test_exclude_files): New test.
    (exclude_dirs_data): Renamed to exclude_data. Added extension tests.
    (reference_file): New global variable.
    * finddups.py (find_duplicates_in_dirs): New parameter exclude_files.
    (index_files_by_size): Likewise.
    israel-lugo committed Sep 10, 2016
    Copy the full SHA
    096bfe3 View commit details
  7. Implement file exclusion. See #15.

    * finddups.py (index_files_by_size): Prune filenames.
    (prune_subdirs): Renamed to prune_names. Made generic,
    doesn't care if the name is a subdir or a filename.
    (should_be_excluded): Made generic, doesn't care if the name is a subdir
    or a filename.
    * test_excludes.py (test_exclude_files): Unmark xfail, is now
    implemented.
    * CHANGELOG.rst: Document new feature.
    israel-lugo committed Sep 10, 2016
    Copy the full SHA
    4c55220 View commit details
  8. Merge branch 'exclude-files'. Closes #15.

    * exclude-files:
      Implement file exclusion. See #15.
      Implement unit tests for file exclusion. See #15.
    israel-lugo committed Sep 10, 2016
    Copy the full SHA
    0380a38 View commit details

Commits on Sep 15, 2016

  1. Copy the full SHA
    4c43281 View commit details

Commits on Sep 20, 2016

  1. Copy the full SHA
    48e4f2f View commit details

Commits on Jan 24, 2017

  1. Remove unused import.

    israel-lugo committed Jan 24, 2017
    Copy the full SHA
    95001dc View commit details

Commits on Feb 6, 2017

  1. Copy the full SHA
    65fe82a View commit details
  2. Copy the full SHA
    1297b49 View commit details
  3. Copy the full SHA
    11477e1 View commit details
  4. Release version 1.1.0.

    israel-lugo committed Feb 6, 2017
    Copy the full SHA
    bdd7627 View commit details

Commits on Sep 10, 2017

  1. Copy the full SHA
    09d3b0a View commit details
  2. Copy the full SHA
    1636177 View commit details

Commits on Sep 14, 2017

  1. Fix false negatives in test_find_dups_in_dirs. Closes #19.

    We were relying on the visitation order of the files within a group.
    israel-lugo committed Sep 14, 2017
    Copy the full SHA
    37572d5 View commit details
  2. Fix false positives in loop detection. See #17.

    We were detecting loops related to symlinks even when follow_dirlinks
    was false, i.e. when the symlinks would be irrelevant.
    israel-lugo committed Sep 14, 2017
    Copy the full SHA
    63c6d9c View commit details

Commits on Sep 15, 2017

  1. Fix false positives in loop detection. Closes #17.

    We had false positives when there was a (forward) symlink to a subdir.
    Fix filter_visited so it doesn't count multiple subdirs pointing to the
    same thing as a loop.
    israel-lugo committed Sep 15, 2017
    Copy the full SHA
    817cb87 View commit details
  2. Fix loop detection to catch loops on first pass.

    Mark the current directory as visited before inspecting the subdirs.
    This lets us catch symlinks to it immediately, instead of after entering
    one level of loop. See #17.
    israel-lugo committed Sep 15, 2017
    Copy the full SHA
    7524d04 View commit details
Showing with 610 additions and 39 deletions.
  1. +78 −0 CHANGELOG.rst
  2. +5 −2 README.rst
  3. +142 −11 capidup/finddups.py
  4. +36 −24 capidup/tests/{test_dups.py → test_dups_full.py}
  5. +93 −0 capidup/tests/test_dups_simple.py
  6. +173 −0 capidup/tests/test_excludes.py
  7. +82 −0 capidup/tests/test_indexing.py
  8. +1 −1 capidup/version.py
  9. +0 −1 setup.py
78 changes: 78 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
CapiDup Change Log
==================

All releases and notable changes will be described here.

CapiDup adheres to `semantic versioning <http://semver.org>`_. In short, this
means the version numbers follow a three-part scheme: *major version*, *minor
version* and *patch number*.

The *major version* is incremented for releases that break compatibility, such
as removing or altering existing functionality. The *minor version* is
incremented for releases that add new visible features, but are still backwards
compatible. The *patch number* is incremented for minor changes such as bug
fixes, that don't change the public interface.


Unreleased__
------------
__ https://github.com/israel-lugo/capidup/compare/v1.1.0...HEAD


Changed
.......

- Detect directory loops while crawling. Can happen e.g. if `follow_dirlinks`
is True and there are symlinks pointing to parent directories. See
`issue #17`_.

Fixed
.....

- Fix occasional false negatives in one of the unit tests. See `issue #19`.


1.1.0_ — 2017-02-06
-------------------

Added
.....

- `find_duplicates_in_dirs` can now exclude directories, through a new optional
parameter `exclude_dirs`. See `issue #10`_.

- `find_duplicates_in_dirs` can now exclude files, through a new optional
parameter `exclude_files`. See `issue #15`_.

- `find_duplicates_in_dirs` can now follow symbolic links to subdirectories,
through a new optional parameter `follow_dirlinks`. See `issue #16`_.

Changed
.......

- Implement unit tests for indexing files with non-ASCII filenames, to make
sure it works. See `issue #9`_.

Fixed
.....

- Fix possible breakage when files are deleted while they are being indexed.
See `issue #12`_.


1.0.1_ — 2016-07-12
-------------------

First production release.


.. _issue #9: https://github.com/israel-lugo/capidup/issues/9
.. _issue #10: https://github.com/israel-lugo/capidup/issues/10
.. _issue #12: https://github.com/israel-lugo/capidup/issues/12
.. _issue #15: https://github.com/israel-lugo/capidup/issues/15
.. _issue #16: https://github.com/israel-lugo/capidup/issues/16
.. _issue #17: https://github.com/israel-lugo/capidup/issues/17
.. _issue #19: https://github.com/israel-lugo/capidup/issues/19

.. _1.1.0: https://github.com/israel-lugo/capidup/tree/v1.1.0
.. _1.0.1: https://github.com/israel-lugo/capidup/tree/v1.0.1
7 changes: 5 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
CapiDup
=======

|license| |Codacy Badge| |Codacy Coverage| |Build Status|
|license| |PyPi version| |PyPi pyversion| |Codacy Badge| |Codacy Coverage| |Build Status|

Quickly find duplicate files in directories.

@@ -109,8 +109,11 @@ would check files with two different hashing algorithms. The tradeoff in speed
would not be worthwhile for any normal use case, but the possibility could be
there.

.. |license| image:: https://img.shields.io/badge/license-GPLv3+-blue.svg
.. |license| image:: https://img.shields.io/badge/license-GPLv3+-blue.svg?maxAge=2592000
:target: LICENSE
.. |PyPi version| image:: https://img.shields.io/pypi/v/capidup.svg
:target: https://pypi.python.org/pypi/capidup
.. |PyPi pyversion| image:: https://img.shields.io/pypi/pyversions/capidup.svg?maxAge=86400
.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/15155f1c5c454678923f5fb79401d151
:target: https://www.codacy.com/app/israel-lugo/capidup
.. |Codacy Coverage| image:: https://api.codacy.com/project/badge/Coverage/15155f1c5c454678923f5fb79401d151
153 changes: 142 additions & 11 deletions capidup/finddups.py
Original file line number Diff line number Diff line change
@@ -41,6 +41,8 @@
import os
import stat
import hashlib
import fnmatch
import errno

from capidup import py3compat

@@ -80,14 +82,103 @@ def round_up_to_mult(n, mult):
return ((n + mult - 1) // mult) * mult


def should_be_excluded(name, exclude_patterns):
"""Check if a name should be excluded.
def index_files_by_size(root, files_by_size):
Returns True if name matches at least one of the exclude patterns in
the exclude_patterns list.
"""
for pattern in exclude_patterns:
if fnmatch.fnmatch(name, pattern):
return True
return False


def prune_names(names, exclude_patterns):
"""Prune subdirs or files from an index crawl.
This is used to control the search performed by os.walk() in
index_files_by_size().
names is the list of file or subdir names, to be pruned as per the
exclude_patterns list.
Returns a new (possibly pruned) names list.
"""
return [x for x in names if not should_be_excluded(x, exclude_patterns)]


def filter_visited(curr_dir, subdirs, already_visited, follow_dirlinks, on_error):
"""Filter subdirs that have already been visited.
This is used to avoid loops in the search performed by os.walk() in
index_files_by_size.
curr_dir is the path of the current directory, as returned by os.walk().
subdirs is the list of subdirectories for the current directory, as
returned by os.walk().
already_visited is a set of tuples (st_dev, st_ino) of already
visited directories. This set will not be modified.
on error is a function f(OSError) -> None, to be called in case of
error.
Returns a tuple: the new (possibly filtered) subdirs list, and a new
set of already visited directories, now including the subdirs.
"""
filtered = []
to_visit = set()
_already_visited = already_visited.copy()

try:
# mark the current directory as visited, so we catch symlinks to it
# immediately instead of after one iteration of the directory loop
file_info = os.stat(curr_dir) if follow_dirlinks else os.lstat(curr_dir)
_already_visited.add((file_info.st_dev, file_info.st_ino))
except OSError as e:
on_error(e)

for subdir in subdirs:
full_path = os.path.join(curr_dir, subdir)
try:
file_info = os.stat(full_path) if follow_dirlinks else os.lstat(full_path)
except OSError as e:
on_error(e)
continue

if not follow_dirlinks and stat.S_ISLNK(file_info.st_mode):
# following links to dirs is disabled, ignore this one
continue

dev_inode = (file_info.st_dev, file_info.st_ino)
if dev_inode not in _already_visited:
filtered.append(subdir)
to_visit.add(dev_inode)
else:
on_error(OSError(errno.ELOOP, "directory loop detected", full_path))

return filtered, _already_visited.union(to_visit)


def index_files_by_size(root, files_by_size, exclude_dirs, exclude_files,
follow_dirlinks):
"""Recursively index files under a root directory.
Each regular file is added *in-place* to the files_by_size dictionary,
according to the file size. This is a (possibly empty) dictionary of
lists of filenames, indexed by file size.
exclude_dirs is a list of glob patterns to exclude directories.
exclude_files is a list of glob patterns to exclude files.
follow_dirlinks controls whether to follow symbolic links to
subdirectories while crawling.
Returns True if there were any I/O errors while listing directories.
Returns a list of error messages that occurred. If empty, there were no
@@ -97,6 +188,7 @@ def index_files_by_size(root, files_by_size):
# encapsulate the value in a list, so we can modify it by reference
# inside the auxiliary function
errors = []
already_visited = set()

def _print_error(error):
"""Print a listing error to stderr.
@@ -112,13 +204,32 @@ def _print_error(error):
errors.append(msg)


# XXX: The actual root may be matched by the exclude pattern. Should we
# prune it as well?

for curr_dir, subdirs, filenames in os.walk(root, topdown=True,
onerror=_print_error, followlinks=follow_dirlinks):

for curr_dir, _, filenames in os.walk(root, onerror=_print_error):
# modify subdirs in-place to influence os.walk
subdirs[:] = prune_names(subdirs, exclude_dirs)
filenames = prune_names(filenames, exclude_files)

# remove subdirs that have already been visited; loops can happen
# if there's a symlink loop and follow_dirlinks==True, or if
# there's a hardlink loop (which is usually a corrupted filesystem)
subdirs[:], already_visited = filter_visited(curr_dir, subdirs,
already_visited, follow_dirlinks, _print_error)

for base_filename in filenames:
full_path = os.path.join(curr_dir, base_filename)

file_info = os.lstat(full_path)
# avoid race condition: file can be deleted between os.walk()
# seeing it and us calling os.lstat()
try:
file_info = os.lstat(full_path)
except OSError as e:
_print_error(e)
continue

# only want regular files, not symlinks
if stat.S_ISREG(file_info.st_mode):
@@ -196,7 +307,7 @@ def find_duplicates(filenames, max_size):
>>> dups, errs = find_duplicates(['a1', 'a2', 'b', 'c1', 'c2'], 1024)
>>> dups
[['a1', 'a2'], ['c1', 'c2']]
>>> errors
>>> errs
[]
Note that ``b`` is not included in the results, as it has no duplicates.
@@ -241,9 +352,20 @@ def find_duplicates(filenames, max_size):



def find_duplicates_in_dirs(directories):
def find_duplicates_in_dirs(directories, exclude_dirs=None, exclude_files=None,
follow_dirlinks=False):
"""Recursively scan a list of directories, looking for duplicate files.
`exclude_dirs`, if provided, should be a list of glob patterns.
Subdirectories whose names match these patterns are excluded from the
scan.
`exclude_files`, if provided, should be a list of glob patterns. Files
whose names match these patterns are excluded from the scan.
``follow_dirlinks`` controls whether to follow symbolic links to
subdirectories while crawling.
Returns a 2-tuple of two values: ``(duplicate_groups, errors)``.
`duplicate_groups` is a (possibly empty) list of lists: the names of files
@@ -252,22 +374,31 @@ def find_duplicates_in_dirs(directories):
`errors` is a list of error messages that occurred. If empty, there were no
errors.
For example, assuming ``./a1`` and ``dir1/a2`` are identical, ``dir1/c1`` and
``dir2/c2`` are identical, and ``dir2/b`` is different from all others:
For example, assuming ``./a1`` and ``/dir1/a2`` are identical,
``/dir1/c1`` and ``/dir2/c2`` are identical, ``/dir2/b`` is different
from all others, that any subdirectories called ``tmp`` should not
be scanned, and that files ending in ``.bak`` should be ignored:
>>> dups, errs = find_duplicates(['.', 'dir1', 'dir2'])
>>> dups, errs = find_duplicates_in_dirs(['.', '/dir1', '/dir2'], ['tmp'], ['*.bak'])
>>> dups
[['./a1', 'dir1/a2'], ['dir1/c1', 'dir2/c2']]
>>> errors
[['./a1', '/dir1/a2'], ['/dir1/c1', '/dir2/c2']]
>>> errs
[]
"""
if exclude_dirs is None:
exclude_dirs = []

if exclude_files is None:
exclude_files = []

errors_in_total = []
files_by_size = {}

# First, group all files by size
for directory in directories:
sub_errors = index_files_by_size(directory, files_by_size)
sub_errors = index_files_by_size(directory, files_by_size, exclude_dirs,
exclude_files, follow_dirlinks)
errors_in_total += sub_errors

all_duplicates = []
Loading