-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expand germanic street abbreviations #68
Conversation
GET /pelias/_search?search_type=count
{
"aggs": {
"field_aggs": {
"terms": {
"field": "parent.country_a"
},
"aggs": {
"germanic_ending": {
"filter": {
"regexp": {
"address.street": "[^ ]+(str)"
}
}
}
}
}
}
} {
"took": 10014,
"timed_out": false,
"_shards": {
"total": 12,
"successful": 12,
"failed": 0
},
"hits": {
"total": 279912234,
"max_score": 0,
"hits": []
},
"aggregations": {
"field_aggs": {
"doc_count_error_upper_bound": 39547,
"sum_other_doc_count": 57718434,
"buckets": [
{
"key": "fra",
"doc_count": 25905674,
"germanic_ending": {
"doc_count": 5
}
},
{
"key": "nld",
"doc_count": 17577224,
"germanic_ending": {
"doc_count": 97
}
},
{
"key": "deu",
"doc_count": 14241847,
"germanic_ending": {
"doc_count": 58369
}
},
{
"key": "cze",
"doc_count": 5972478,
"germanic_ending": {
"doc_count": 1
}
}
]
}
}
} The French ones are:
The Czech one is (I think this is actually DEU but incorrectly attributed as CZE, it's ~1km from the border)
The Dutch actually expand differently:
|
While I have no objections to corrections to data that the OA team is wary of taking responsibility for, I would like to see such actions take place in a more targeted way. That is, currently data cleanup is happened as soon as the record is read before additional context of admin-lookup is added. Since this change is specifically for German street names, it should probably happen after admin-lookup so that modifications can benefit from knowing the country. |
right @trescube agree, I'll move the stream after the admin lookup and only target German addresses. |
This looks like a really good change to have. We should be careful though, since it will be running across LOTS of data. First, I like stephen's suggestion of limiting it to just a few countries. The aggregation above has a line Second, we have to make sure that even within those countries, it doesn't do something unexpected. When I initially added the street name cleanup method, I hacked my importer to print out a line each time it made a change, with the old value, and the new one, and then skip the rest of the import pipeline. I ran the resulting output file through |
ab92249
to
4fd7c0d
Compare
I had some time to fool around with this, and made some quick modifications to the import pipeline so that it prints out these changes so we can look at them. There were only 2277 unique changes made, they're attached so we can look at them, but we probably also want to look at what country the changes were made in, since I suspect some of them are acceptable changes in, say, Germany, but not the UK. |
The branch with those modifications is https://github.com/pelias/openaddresses/tree/expand_german_test |
👍 |
// expand '-str.' to '-strasse' | ||
// note: '-straße' is only used in Germany, other countries like | ||
// switzerland use 'strasse'. | ||
function expandGermanicStreetSuffixes(token) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this functionality still required? it seems to be a duplicate of the functions in lib/streams/germanAbbreviationStream.js
So, to be clear, we dont want something like Foo Str. to be expanded to Foo Strasse? It seems the only time that you'd get fooStrasse is with the initial string fooStr(.) Is fooStr a possible way of writing this? It would seem that you'd either add a space or keep the s lower case. If it really is used then changing it is no big deal. Point number 2. The way the regex for moldova should work is that its anchored at the beginning (with ^) and matches 0 or more spaces after that followed by either S or s followed by t and r and maybe a . and ending with a space. So this regex should only catch |
Yes, you are correct in saying we would like I was referring to this situation, which should return "fooStr".replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2trasse')
"fooStrasse" It's best to keep the words compound or separate depending on the source, so:
I think I didn't read the Moldava regex properly the first time, I think your version looks good
|
So would the best way to fix that just be to change |
@@ -1,20 +1,45 @@ | |||
var _ = require('lodash'); | |||
var through = require('through2'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to keep things clear it would be great to rename this file to germanicAbbreviationStream
. A small change but an important one for clarity.
input_stream.pipe(testedStream).pipe(destination_stream); | ||
} | ||
|
||
tape( 'germanStream expands tokens ending in "-str." to "-strasse" (mostly DEU)', function(test) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
german -> germanic
64a1363
to
6359cc0
Compare
6359cc0
to
cfa131a
Compare
LGTM |
var through = require('through2'); | ||
|
||
// match strings ending in one of: ['str.', 'Str.', 'str', 'Str'] | ||
var REGEX_MATCH_STREET_ABBR_SUFFIX = /([^\s]+)(s|S)tr\.?$/i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifying (s|S)
and the i
flag is redundant
otherwise, |
well the (s|S) captures the case of the first letter, which i want, whereas the i flag doesnt |
Sure it does:
|
wow im dumb, i didnt even think about capturing just the s. i will change it now. |
|
eg. 'Lindenstr' -> 'Lindenstrasse'
closes pelias/pelias#279