Skip to content

Output URL are not correctly encoded #142

Open
@maaaaz

Description

Hello there,

I observe that even the latest current version of ODD (v3.1.0.1) does not properly encode URL in the output file.

Let me detail the case:

  1. First, let's ODD a (randomly found on the internet) website containing some special chars in the path:
$ ./OpenDirectoryDownloader -u "https://gregoirelorieux.net/paysagescomposes/villes/Melle/" --output-file test
[...]
Finshed indexing
[...]
Saving URL list to file..
Saved URL list to file: /tmp/test.txt
  1. Then let's see the first results of the output file:
$ head test.txt
https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif
[...]
  1. If we try to download the first file with wget (and even other download managers), it fails because there are unencoded characters in the URL: "#" and whitespaces.
$ wget -v "https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif"
--2024-10-29 23:22:12--  https://gregoirelorieux.net/paysagescomposes/villes/Melle/
Resolving gregoirelorieux.net (gregoirelorieux.net)... 213.186.33.87
Connecting to gregoirelorieux.net (gregoirelorieux.net)|213.186.33.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 844 [text/html]
Saving to: ‘index.html’

index.html                              100%[===============================================================================>]     844  --.-KB/s    in 0s

2024-10-29 23:22:13 (550 MB/s) - ‘index.html’ saved [844/844]

Here, the downloaded file:

  • is not the asked one: https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif
  • but is from this automatically split link: https://gregoirelorieux.net/paysagescomposes/villes/Melle/
    wget ignores everything after finding a special char, the first one here is "#"

The correct encoded link in the ODD output file should be:
https://gregoirelorieux.net/paysagescomposes/villes/Melle/%233%2021%20jan/Melle/contrebasse-echantillons/cb-arco-1.aif

Instead of:
https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif

Can you fix it ?

The encodeURIComponent function should help.

Cheers!

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions