Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

confusing default --line-max-size #10

Closed
freeroute opened this issue Oct 31, 2019 · 5 comments
Closed

confusing default --line-max-size #10

freeroute opened this issue Oct 31, 2019 · 5 comments
Labels
needs documentation this issue happened because some documentation or FAQ is missing/confusing

Comments

@freeroute
Copy link

I just tested duplicat with a 2.5 GB dictionary.

file dictionary_private.dic : data
time sort -u dictionary_private.dic >dict_sort_uniq.txt
real 5m40,168s
user 13m9,512s
sys 0m7,682s
time duplicut dictionary_private.dic -o dict_dedupe.txt

real 0m47,435s
user 0m32,963s

duplicut is much faster than the "sort -u " command.
but the result not same. counting the lines of new worldlists.

wc -l dict_*
171193011 dict_dedupe.txt
205241662 dict_sort_uniq.txt

number of lines of the original file:
wc -l dictionary_private.dic
206282806 dictionary_private.dic
What can cause this discrepancy?

@nil0x42
Copy link
Owner

nil0x42 commented Nov 6, 2019

Hello !
First and foremore, duplicut is not meant to be faster than sort -u (but i'm happy to see it is in some cases).

What sort -u does is is sorting the file alphabetically, then iterating through lines to see if line == line+1, and delete if yes.
What duplicut does is actually a lot more complicated, as it is able to remove duplicates without sorting. So if you don't mind keeping the original order, it might be better to use sort, or other tools for duplicate removal.

But anyway, i suppose you used sort -u just to compare outputs and check if duplicut actually works.
So i invite you to run duplicut --help, and take a look at the options.

For example, there is --line-max-size, which defaults to 14, meaning that lines greater than 14 chars are removed, event if unique.

Also, empty lines are automatically removed by duplicut.

These aditional behaviors exist because duplicut is mean to aggregate password wordlists, without losing the order, and without having duplicates. And in a passwords wordlist context, i rarely want to keep lines longer than 14 chars, as they might be a garbage line, a too long password to deserve to be guessed, of a parsing error from the tool that generated this line. Empty line are also deleted for being obviously useless in a wordlist of passwords.


Anyway, if you want to test duplicut, i recommend you to check at these files from my test suite:
https://github.com/nil0x42/duplicut/blob/master/test/scripts/remove-duplicates.py
https://github.com/nil0x42/duplicut/blob/master/test/tests/nonreg.sh

remove-duplicates.py is a small python script meant to behave like duplicut (it's just million times slower :)) so you can read it to see what's different from sort, and you can compare it's output with duplicut's.

@nil0x42
Copy link
Owner

nil0x42 commented Nov 6, 2019

Anyway, if i answered your doubts, and if the issue is resolved, feel free to close it. Othersiwe, i'll be happy to debug with you !

@nil0x42 nil0x42 added the needs documentation this issue happened because some documentation or FAQ is missing/confusing label Sep 4, 2020
@nil0x42
Copy link
Owner

nil0x42 commented Sep 4, 2020

Adding needs documentation label, because this issue has probably been caused by --line-max-size option being unclearly documented
Possible fixes:
Add a phrase when duplicut phrase saying exactly how the wordlist is going to be filtered, something like:

removing lines larger than `N` chars, containing non-printable chars, or duplicated.

@nil0x42
Copy link
Owner

nil0x42 commented Sep 9, 2020

Another interesting interesting 'user warning' would be to inform user if no \n has been found in file's first 4096 bytes (because file might be an old-style macOS \r newline separated wordlist)

@nil0x42
Copy link
Owner

nil0x42 commented Sep 27, 2020

@freeroute , can you please confirm me if the problem was due to --line-max-size option ?

@nil0x42 nil0x42 changed the title Different output confusing default --line-max-size Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs documentation this issue happened because some documentation or FAQ is missing/confusing
Projects
None yet
Development

No branches or pull requests

2 participants