Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expand germanic street abbreviations #68

Merged
merged 7 commits into from
Aug 19, 2016

Conversation

missinglink
Copy link
Member

// expand '-str.' to '-strasse'
// note: '-straße' is only used in Germany, other countries like
// switzerland use 'strasse'.

eg. 'Lindenstr' -> 'Lindenstrasse'

closes pelias/pelias#279

@missinglink
Copy link
Member Author

GET /pelias/_search?search_type=count
{
   "aggs": {
      "field_aggs": {
         "terms": {
            "field": "parent.country_a"
         },
         "aggs": {
            "germanic_ending": {
               "filter": {
                  "regexp": {
                     "address.street": "[^ ]+(str)"
                  }
               }
            }
         }
      }
   }
}
{
   "took": 10014,
   "timed_out": false,
   "_shards": {
      "total": 12,
      "successful": 12,
      "failed": 0
   },
   "hits": {
      "total": 279912234,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "field_aggs": {
         "doc_count_error_upper_bound": 39547,
         "sum_other_doc_count": 57718434,
         "buckets": [
            {
               "key": "fra",
               "doc_count": 25905674,
               "germanic_ending": {
                  "doc_count": 5
               }
            },
            {
               "key": "nld",
               "doc_count": 17577224,
               "germanic_ending": {
                  "doc_count": 97
               }
            },
            {
               "key": "deu",
               "doc_count": 14241847,
               "germanic_ending": {
                  "doc_count": 58369
               }
            },
            {
               "key": "cze",
               "doc_count": 5972478,
               "germanic_ending": {
                  "doc_count": 1
               }
            }
         ]
      }
   }
}

The French ones are:

  • 5172 kermestr
  • 5248 kermestr
  • 289 kermestr
  • 82 kerglastr
  • 5025 kerglastr

The Czech one is (I think this is actually DEU but incorrectly attributed as CZE, it's ~1km from the border)

  • 2 Jägershoferstr

The Dutch actually expand differently:

  • Borgercompagniesterstr -> Borgercompagniesterstraat

@trescube
Copy link
Contributor

While I have no objections to corrections to data that the OA team is wary of taking responsibility for, I would like to see such actions take place in a more targeted way. That is, currently data cleanup is happened as soon as the record is read before additional context of admin-lookup is added. Since this change is specifically for German street names, it should probably happen after admin-lookup so that modifications can benefit from knowing the country.

@missinglink
Copy link
Member Author

right @trescube agree, I'll move the stream after the admin lookup and only target German addresses.

@orangejulius
Copy link
Member

This looks like a really good change to have.

We should be careful though, since it will be running across LOTS of data.

First, I like stephen's suggestion of limiting it to just a few countries. The aggregation above has a line "sum_other_doc_count": 57718434, which says there are 57M matching documents not in those 4 countries listed.

Second, we have to make sure that even within those countries, it doesn't do something unexpected. When I initially added the street name cleanup method, I hacked my importer to print out a line each time it made a change, with the old value, and the new one, and then skip the rest of the import pipeline. I ran the resulting output file through sort | uniq -c to see how many unique changes there were (there were only a couple hundred), and looked at each one to make sure it was ok. Let's do the same thing here.

@orangejulius
Copy link
Member

I had some time to fool around with this, and made some quick modifications to the import pipeline so that it prints out these changes so we can look at them. There were only 2277 unique changes made, they're attached so we can look at them, but we probably also want to look at what country the changes were made in, since I suspect some of them are acceptable changes in, say, Germany, but not the UK.
changes.txt

@orangejulius
Copy link
Member

The branch with those modifications is https://github.com/pelias/openaddresses/tree/expand_german_test

@trescube
Copy link
Contributor

trescube commented Jul 7, 2016

👍

// expand '-str.' to '-strasse'
// note: '-straße' is only used in Germany, other countries like
// switzerland use 'strasse'.
function expandGermanicStreetSuffixes(token) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this functionality still required? it seems to be a duplicate of the functions in lib/streams/germanAbbreviationStream.js

@avulfson17
Copy link
Contributor

So, to be clear, we dont want something like Foo Str. to be expanded to Foo Strasse? It seems the only time that you'd get fooStrasse is with the initial string fooStr(.)

Is fooStr a possible way of writing this? It would seem that you'd either add a space or keep the s lower case. If it really is used then changing it is no big deal.

Point number 2. The way the regex for moldova should work is that its anchored at the beginning (with ^) and matches 0 or more spaces after that followed by either S or s followed by t and r and maybe a . and ending with a space. So this regex should only catch str, str., Str., and Str at the beginning of the street name. If this gets used then the space is necessary because the code checks for a space in order to match it

@missinglink
Copy link
Member Author

Yes, you are correct in saying we would like Foo Str. to be Foo Strasse, I hadn't thought about that case, and in that case it makes sense to capitalize the street token.

I was referring to this situation, which should return foostrasse (when it's a compound word)

"fooStr".replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2trasse')
"fooStrasse"

It's best to keep the words compound or separate depending on the source, so:

foo str. -> foo strasse
foostr. -> foostrasse

I think I didn't read the Moldava regex properly the first time, I think your version looks good

"Str. foo".replace(/^([\s]*)(s|S)tr\.?\s/i,'$2trada ')
"Strada foo"

@avulfson17
Copy link
Contributor

So would the best way to fix that just be to change
"fooStr".replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2trasse') to
"fooStr".replace(/([^\s]+)(s|\sS)tr\.?$/i,'$1$2trasse') ?

@@ -1,20 +1,45 @@
var _ = require('lodash');
var through = require('through2');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep things clear it would be great to rename this file to germanicAbbreviationStream. A small change but an important one for clarity.

input_stream.pipe(testedStream).pipe(destination_stream);
}

tape( 'germanStream expands tokens ending in "-str." to "-strasse" (mostly DEU)', function(test) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

german -> germanic

@avulfson17 avulfson17 force-pushed the expand_german_street_abbreviations branch from 64a1363 to 6359cc0 Compare August 11, 2016 14:43
@avulfson17 avulfson17 force-pushed the expand_german_street_abbreviations branch from 6359cc0 to cfa131a Compare August 17, 2016 16:50
@orangejulius
Copy link
Member

LGTM

var through = require('through2');

// match strings ending in one of: ['str.', 'Str.', 'str', 'Str']
var REGEX_MATCH_STREET_ABBR_SUFFIX = /([^\s]+)(s|S)tr\.?$/i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying (s|S) and the i flag is redundant

@trescube
Copy link
Contributor

otherwise, :shipit:

@avulfson17
Copy link
Contributor

well the (s|S) captures the case of the first letter, which i want, whereas the i flag doesnt

@trescube
Copy link
Contributor

Sure it does:

> 'abc'.replace(/(a)/i, '$1');
'abc'
> 'Abc'.replace(/(a)/i, '$1');
'Abc'

@avulfson17
Copy link
Contributor

wow im dumb, i didnt even think about capturing just the s. i will change it now.

@trescube
Copy link
Contributor

:shipit:!

@orangejulius orangejulius merged commit 9d4455a into master Aug 19, 2016
@orangejulius orangejulius deleted the expand_german_street_abbreviations branch August 19, 2016 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

openaddresses German street names imported in the contracted form.
4 participants