expand germanic street abbreviations #68

missinglink · 2016-03-16T17:53:33Z

// expand '-str.' to '-strasse'
// note: '-straße' is only used in Germany, other countries like
// switzerland use 'strasse'.

eg. 'Lindenstr' -> 'Lindenstrasse'

closes pelias/pelias#279

missinglink · 2016-03-16T18:32:53Z

GET /pelias/_search?search_type=count
{
   "aggs": {
      "field_aggs": {
         "terms": {
            "field": "parent.country_a"
         },
         "aggs": {
            "germanic_ending": {
               "filter": {
                  "regexp": {
                     "address.street": "[^ ]+(str)"
                  }
               }
            }
         }
      }
   }
}

{
   "took": 10014,
   "timed_out": false,
   "_shards": {
      "total": 12,
      "successful": 12,
      "failed": 0
   },
   "hits": {
      "total": 279912234,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "field_aggs": {
         "doc_count_error_upper_bound": 39547,
         "sum_other_doc_count": 57718434,
         "buckets": [
            {
               "key": "fra",
               "doc_count": 25905674,
               "germanic_ending": {
                  "doc_count": 5
               }
            },
            {
               "key": "nld",
               "doc_count": 17577224,
               "germanic_ending": {
                  "doc_count": 97
               }
            },
            {
               "key": "deu",
               "doc_count": 14241847,
               "germanic_ending": {
                  "doc_count": 58369
               }
            },
            {
               "key": "cze",
               "doc_count": 5972478,
               "germanic_ending": {
                  "doc_count": 1
               }
            }
         ]
      }
   }
}

The French ones are:

5172 kermestr
5248 kermestr
289 kermestr
82 kerglastr
5025 kerglastr

The Czech one is (I think this is actually DEU but incorrectly attributed as CZE, it's ~1km from the border)

2 Jägershoferstr

The Dutch actually expand differently:

Borgercompagniesterstr -> Borgercompagniesterstraat

trescube · 2016-03-16T18:36:50Z

While I have no objections to corrections to data that the OA team is wary of taking responsibility for, I would like to see such actions take place in a more targeted way. That is, currently data cleanup is happened as soon as the record is read before additional context of admin-lookup is added. Since this change is specifically for German street names, it should probably happen after admin-lookup so that modifications can benefit from knowing the country.

missinglink · 2016-03-16T18:39:54Z

right @trescube agree, I'll move the stream after the admin lookup and only target German addresses.

orangejulius · 2016-03-16T20:12:41Z

This looks like a really good change to have.

We should be careful though, since it will be running across LOTS of data.

First, I like stephen's suggestion of limiting it to just a few countries. The aggregation above has a line "sum_other_doc_count": 57718434, which says there are 57M matching documents not in those 4 countries listed.

Second, we have to make sure that even within those countries, it doesn't do something unexpected. When I initially added the street name cleanup method, I hacked my importer to print out a line each time it made a change, with the old value, and the new one, and then skip the rest of the import pipeline. I ran the resulting output file through sort | uniq -c to see how many unique changes there were (there were only a couple hundred), and looked at each one to make sure it was ok. Let's do the same thing here.

orangejulius · 2016-07-07T13:22:27Z

I had some time to fool around with this, and made some quick modifications to the import pipeline so that it prints out these changes so we can look at them. There were only 2277 unique changes made, they're attached so we can look at them, but we probably also want to look at what country the changes were made in, since I suspect some of them are acceptable changes in, say, Germany, but not the UK.
changes.txt

orangejulius · 2016-07-07T13:22:56Z

The branch with those modifications is https://github.com/pelias/openaddresses/tree/expand_german_test

trescube · 2016-07-07T13:35:43Z

👍

missinglink · 2016-08-03T14:19:58Z

lib/cleanup.js

+// expand '-str.' to '-strasse'
+// note: '-straße' is only used in Germany, other countries like
+// switzerland use 'strasse'.
+function expandGermanicStreetSuffixes(token) {


is this functionality still required? it seems to be a duplicate of the functions in lib/streams/germanAbbreviationStream.js

avulfson17 · 2016-08-03T15:55:48Z

So, to be clear, we dont want something like Foo Str. to be expanded to Foo Strasse? It seems the only time that you'd get fooStrasse is with the initial string fooStr(.)

Is fooStr a possible way of writing this? It would seem that you'd either add a space or keep the s lower case. If it really is used then changing it is no big deal.

Point number 2. The way the regex for moldova should work is that its anchored at the beginning (with ^) and matches 0 or more spaces after that followed by either S or s followed by t and r and maybe a . and ending with a space. So this regex should only catch str, str., Str., and Str at the beginning of the street name. If this gets used then the space is necessary because the code checks for a space in order to match it

missinglink · 2016-08-03T16:41:56Z

Yes, you are correct in saying we would like Foo Str. to be Foo Strasse, I hadn't thought about that case, and in that case it makes sense to capitalize the street token.

I was referring to this situation, which should return foostrasse (when it's a compound word)

"fooStr".replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2trasse')
"fooStrasse"

It's best to keep the words compound or separate depending on the source, so:

foo str. -> foo strasse
foostr. -> foostrasse

I think I didn't read the Moldava regex properly the first time, I think your version looks good

"Str. foo".replace(/^([\s]*)(s|S)tr\.?\s/i,'$2trada ')
"Strada foo"

avulfson17 · 2016-08-03T16:45:38Z

So would the best way to fix that just be to change
"fooStr".replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2trasse') to
"fooStr".replace(/([^\s]+)(s|\sS)tr\.?$/i,'$1$2trasse') ?

orangejulius · 2016-08-10T17:05:53Z

lib/streams/germanAbbreviationStream.js

@@ -1,20 +1,45 @@
-var _ = require('lodash');
 var through = require('through2');


Just to keep things clear it would be great to rename this file to germanicAbbreviationStream. A small change but an important one for clarity.

orangejulius · 2016-08-10T17:15:50Z

test/streams/germanicAbbreviationStream.js

+  input_stream.pipe(testedStream).pipe(destination_stream);
+}
+
+tape( 'germanStream expands tokens ending in "-str." to "-strasse" (mostly DEU)', function(test) {


german -> germanic

orangejulius · 2016-08-17T16:52:15Z

LGTM

trescube · 2016-08-17T19:45:01Z

lib/streams/germanicAbbreviationStream.js

+var through = require('through2');
+
+// match strings ending in one of: ['str.', 'Str.', 'str', 'Str']
+var REGEX_MATCH_STREET_ABBR_SUFFIX = /([^\s]+)(s|S)tr\.?$/i;


Specifying (s|S) and the i flag is redundant

trescube · 2016-08-17T19:51:56Z

otherwise,

avulfson17 · 2016-08-17T20:05:02Z

well the (s|S) captures the case of the first letter, which i want, whereas the i flag doesnt

trescube · 2016-08-17T20:17:37Z

Sure it does:

> 'abc'.replace(/(a)/i, '$1');
'abc'
> 'Abc'.replace(/(a)/i, '$1');
'Abc'

avulfson17 · 2016-08-17T20:19:23Z

wow im dumb, i didnt even think about capturing just the s. i will change it now.

trescube · 2016-08-17T20:38:07Z

!

missinglink added in progress in review and removed in progress labels Mar 16, 2016

missinglink self-assigned this Mar 16, 2016

missinglink mentioned this pull request Mar 17, 2016

Refactor autocomplete analysis pelias/schema#109

Closed

orangejulius added in progress and removed in review labels Apr 21, 2016

missinglink mentioned this pull request Apr 29, 2016

autocomplete milestone pelias/schema#127

Merged

orangejulius force-pushed the expand_german_street_abbreviations branch from ab92249 to 4fd7c0d Compare July 2, 2016 13:44

orangejulius assigned avulfson17 and unassigned missinglink Jul 7, 2016

missinglink mentioned this pull request Jul 26, 2016

openaddresses German street names imported in the contracted form. pelias/pelias#279

Closed

missinglink reviewed Aug 3, 2016
View reviewed changes

orangejulius reviewed Aug 10, 2016
View reviewed changes

avulfson17 force-pushed the expand_german_street_abbreviations branch from 64a1363 to 6359cc0 Compare August 11, 2016 14:43

missinglink and others added 6 commits August 17, 2016 12:46

expand germanic street abbreviations

63b1197

Expand str abbreviations

c5ac5f9

Added tests and played around with the regex

0101c9c

Fixing country codes and deleting extraneous code

2b7e9d3

Refactoring germanAbbreviation stream

48ddca6

Renamed germanAbbreviation file

cfa131a

avulfson17 force-pushed the expand_german_street_abbreviations branch from 6359cc0 to cfa131a Compare August 17, 2016 16:50

trescube reviewed Aug 17, 2016
View reviewed changes

Remove redundancy from regular expression

cada975

orangejulius merged commit 9d4455a into master Aug 19, 2016

orangejulius removed the in progress label Aug 19, 2016

orangejulius deleted the expand_german_street_abbreviations branch August 19, 2016 18:07

orangejulius mentioned this pull request Mar 3, 2022

remove unused germanicAbbreviationStream #501

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expand germanic street abbreviations #68

expand germanic street abbreviations #68

missinglink commented Mar 16, 2016

missinglink commented Mar 16, 2016

trescube commented Mar 16, 2016

missinglink commented Mar 16, 2016

orangejulius commented Mar 16, 2016

orangejulius commented Jul 7, 2016

orangejulius commented Jul 7, 2016

trescube commented Jul 7, 2016

missinglink Aug 3, 2016

avulfson17 commented Aug 3, 2016

missinglink commented Aug 3, 2016

avulfson17 commented Aug 3, 2016

orangejulius Aug 10, 2016

orangejulius Aug 10, 2016

orangejulius commented Aug 17, 2016

trescube Aug 17, 2016

trescube commented Aug 17, 2016

avulfson17 commented Aug 17, 2016

trescube commented Aug 17, 2016

avulfson17 commented Aug 17, 2016

trescube commented Aug 17, 2016

		@@ -1,20 +1,45 @@
		var _ = require('lodash');
		var through = require('through2');

expand germanic street abbreviations #68

expand germanic street abbreviations #68

Conversation

missinglink commented Mar 16, 2016

missinglink commented Mar 16, 2016

trescube commented Mar 16, 2016

missinglink commented Mar 16, 2016

orangejulius commented Mar 16, 2016

orangejulius commented Jul 7, 2016

orangejulius commented Jul 7, 2016

trescube commented Jul 7, 2016

missinglink Aug 3, 2016

Choose a reason for hiding this comment

avulfson17 commented Aug 3, 2016

missinglink commented Aug 3, 2016

avulfson17 commented Aug 3, 2016

orangejulius Aug 10, 2016

Choose a reason for hiding this comment

orangejulius Aug 10, 2016

Choose a reason for hiding this comment

orangejulius commented Aug 17, 2016

trescube Aug 17, 2016

Choose a reason for hiding this comment

trescube commented Aug 17, 2016

avulfson17 commented Aug 17, 2016

trescube commented Aug 17, 2016

avulfson17 commented Aug 17, 2016

trescube commented Aug 17, 2016