[04:56:38] XioNoX: cool :) [06:07:19] <_joe_> nice indeed [12:43:01] <_joe_> I don't see how this url can be valid [12:43:07] <_joe_> hi mateusbs17 [12:43:13] <_joe_> is this url valid for maps? [12:43:15] hey _joe_ [12:43:22] <_joe_> /img/osm-intl,12,50.0863889,14.4063889,240x150.png [12:43:27] it does return reasonable content [12:43:49] <_joe_> these are the urls that are timing out, mostly [12:43:55] Yes, it is valid [12:44:03] <_joe_> oh gosh [12:44:15] I was able to get its content [12:44:57] <_joe_> yeah me too, from codfw [12:45:01] Most of the URLs in this list are valid https://phabricator.wikimedia.org/P9103 [12:45:25] <_joe_> and I guess you're served by codfw [12:45:29] The exceptions are the ones that don't provide a source like `/18/149486/83304.png ` [12:45:50] <_joe_> ok, can you see why that could cause what we are observing? namely very high cpu usage? [12:45:57] I'm also running a a local environment to test it [12:47:12] <_joe_> first thing I'm sure of: urls with a $ all go to timeout [12:47:44] Yes I'm trying to find the timing out issue, but might be a service-runner thing when there are too many worker dying it hang until some worker availability, but that's just guess [12:48:01] <_joe_> that's one of the things yes [12:48:07] <_joe_> workers are dying in flocks [12:48:11] <_joe_> being killed, rather [12:48:26] negative coordinates also seem to be timing out [12:48:46] <_joe_> not all? unless I did some mistake earlier [12:49:17] not all what? [12:49:25] AFAIK, tile coordinates must be a positive integer [12:50:08] They are different from other coordinates like EPSG:3857/4326 which use mercator coordinates and can have negative values [12:50:25] <_joe_> gehel: https://dpaste.de/5g4y [12:50:50] <_joe_> mateusbs17: see the paste above, it seems to work [12:51:11] I was testing on /osm-intl [12:51:17] You are not wronf _joe_, when you request an img (snapshot service) you must provide EPSG:3857 coordinates [12:51:17] <_joe_> ok [12:51:40] <_joe_> so it's negative and just /osm-intl [12:51:51] err sorry I guess over here [12:52:05] tile coordinates should be positive integers [12:52:08] I'd just limit your nginx fix to ^/osm-intl for now instead of /img/... [12:52:35] longitudes / latitudes as used in /img seem to allow negative (but why ?) [12:52:52] because they're raw coordinates, which can be negative [12:52:57] (as in geoid coordinates) [12:53:18] Oh yes, east is negative [12:53:36] or west, [12:55:46] <_joe_> can I have an example of invalid url? [12:56:00] curl -I -k -H 'Host: maps.wikimedia.org' 'https://maps2002/osm-intl/11/-1029/698.png' [13:03:13] the regex for the /img/osm-intl endpoint seems to be [13:03:16] /img/osm-intl(,(\d+|a)){3},\d+x\d+.(png|json|headers|svg|jpeg)(\?|$) [13:04:20] bblack: I left one comment in the patch, considering the discussion here are we able to limit negative integers in /osm-intl [13:06:47] sorry correction [13:06:48] \/img\/osm-intl(,([\d\.-]+|a)){3},\d+x\d+.(png|json|headers|svg|jpeg)(\?|$) [13:07:15] as far as i can tell that matxches all the http 200 we see on the /img/osm-intl uri [13:08:25] mateusbs17: the varnish one is anchored at ^/osm-intl and doesn't deal with the /img/ version [13:08:40] and the regex doesn't allow negative (\d is just 0-9) [13:09:27] could add one for /img/ separately though, use jbond42 suggestions [13:09:31] Given the results on maps1003, bblack can we merge the varnish one ? [13:09:49] bblack: was just drafting something [13:10:24] bblack: LGTM [13:11:09] ok will push what I have, we can go add a /img/ one after if we want [13:11:20] jbond42: you are missing the scale property `(@\d+x)` before `.(png...` [13:12:08] The README of kartotherian has a full description of the URLs [13:12:08] https://github.com/wikimedia/mediawiki-services-kartotherian/tree/master/packages/kartotherian [13:15:55] mateusbs17: do you know why long and lat seem to accept 'a' [13:16:14] ok the varnish sanity-filter on ^/osm-intl/ is live on all upload varnishes now [13:16:36] it's a for auto, the maps centers automatically when it has features in the map [13:17:32] ahh ok thanks, the /tmp/osm-intl is avalible for review [13:17:37] jbond42 if you have a line or a polygon to be rendered, it can ignore provided lat lon and use the geometries coordinates to position the map [13:18:14] <_joe_> bblack: did you push the sanitization right? [13:18:25] <_joe_> because things don't seem to go that much better [13:18:32] https://gerrit.wikimedia.org/r/c/operations/puppet/+/536588 [13:19:20] _joe_: 13:16 < bblack> ok the varnish sanity-filter on ^/osm-intl/ is live on all upload... [13:26:11] _joe_: yes it's pushed [13:26:32] <_joe_> load is at 20 on maps2003 [13:26:34] maybe it's the slow /img/ ones? [13:26:35] <_joe_> vs 44 before [13:26:45] <_joe_> so it's in better shape now [13:26:54] <_joe_> still bottoming out the cpu at times though [13:27:25] _joe_: s/2003/1003/ ? [13:27:49] bblack: can you check https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/536588 for /img [13:28:34] mateusbs17: can you also re check the regex and re add +1 [13:28:44] <_joe_> yeah sorry gehel [13:29:06] tilerator has been restarted by puppet [13:29:18] <_joe_> yup [13:29:27] <_joe_> I think it's clear it wasn't the issue right [13:29:44] agreed, just wanted to make it clear [13:33:58] so are we clear-ish now, or is there more invalid stuff or things we need to ratelimit harder? [13:34:51] CPU is still high [13:35:14] I think we're missing a general rule to drop any URL with "$" [13:35:29] this still times out: curl -I -k -H 'Host: maps.wikimedia.org' 'https://maps2004/$' [13:35:59] heh [13:36:08] what awful input validation :P [13:36:20] I can just ban the dollar-sign for all maps URIs, assuming there's no legit use [13:36:50] bblack: I think that would be best [13:36:57] https://maps1003.eqiad.wmnet/1 also hangs as well [13:37:01] lol [13:37:04] mateusbs17: is there any reason to allow a "$" anywhere? [13:37:12] the /1 case is interesting [13:37:38] gehel: none I still don't know why it is timing out [13:37:44] maybe we need a whitelist of everything we should allow. it seems like we've already built the two most-difficult regexes [13:38:06] there was a "location ~* /\d\d/" on maps1003 which we did not replicate yet [13:38:58] what's that? [13:40:20] that was extracted from a list of failing urls by joe [13:43:44] Is that any reason to not evaluate the pokemongo heavy access to our infrastructure? [13:43:44] https://turnilo.wikimedia.org/#webrequest_sampled_128/4/N4IgbglgzgrghgGwgLzgFwgewHYgFwgBGcATgLQDGAFqWiADQgYC2Apsjq/iAKJoUB6AKoAVAMIMQAMwgI0rElHwBtUGgCeABy4EWXRiVZTuAfRN7JegAoKsAEyV5VIOzBLosuAlYCMAEUkoeU18Mh9GDW1uBHRWIJAAXwBdBPo1LR0QIJIIbABzSUNjAjcIEypMeMY4CgwcblzJMEQYOJUQZjhNKAA6AHcIAGsINjsIOB7MEgKkxmxMOjwpRChWZMYoTSQ0R2dIzOzcgsYxtmwoT24ihQVAqcXQIu4qCB3LDO5VnLaTiENay4EOxxCisbBjfKSbQ5TB2biJRhIZivfAAVgADOssrY2k5 [13:43:44] HkYbthQc9XkpGFIpp0HkwPkCjHAYHImi1MgiaVECGw4LBDIlZiBNLlsKw7H4RmCLjhdvzBdhhXYAMr3ElvCLihUvKSLEC+ADqkgQrDyYLheGwjIQjBeeSoSBti3NCAQCSAA [13:51:48] mateusbs17: that sound like a product owner question, but it's a good one [13:52:04] more generally, should we allow external entities to use our maps [13:53:28] (which is a decision that has been visited before in the distant past, but could/should be revisited) [13:54:41] i have created https://gerrit.wikimedia.org/r/c/operations/puppet/+/536595 as a starting point for a white list [13:55:07] historical breadcrumbs: [13:55:20] T154704 [13:55:21] T154704: Rate-limit browsers without referers - https://phabricator.wikimedia.org/T154704 [13:55:26] T169175 [13:55:27] T169175: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175 [13:58:08] mateusbs17: does this look like a legitimate request: https://maps2004/18/130337/92348.png [14:00:33] it's missing the source parameter, should be https://maps2004/osm-intl/18/130337/92348.png [14:00:39] ok [14:00:55] so PS4 of jbond's new whitelist thing might work? [14:01:20] err sorry PS5 [14:01:45] just updating with more comments from mateusbs17 [14:04:57] sign language course, back later, sorry [14:05:21] ok [14:05:30] I'm gonna refactor that regex a little, sec [14:06:15] ^\/(robots\.txt|main\.(js|css)|favicon.ico|lib|leaflet|geoshape|geoline|babel|((img\/)?osm-(intl|pbf)|v4|osm)) [14:06:24] bblack: thats the regex i was jjust about to commit [14:09:12] sorry is it geoline or geolite? [14:09:18] geoline [14:09:35] no need for \/ [14:09:40] in the patch version anyways [14:10:13] did "marker" go away too? [14:10:18] (was alongside leaflet, etc) [14:10:59] it was v4/marker before i droped the marker as im not sure if there could be more v4 endpoints [14:11:08] got it [14:11:42] re-picking, I think there's a parens error but have to see it in vi heh [14:12:08] feel free to just push any corrections [14:13:37] yeah it's just easier than trying to speak english about regexen heh [14:13:53] exactly [14:13:55] ps7 I think I fixed up the last bit (also added back babel there) [14:14:44] althrough really the whole (-intl|-pbf)? block is pointless with no trailing anchor, could just be removed [14:15:22] true [14:15:38] did that [14:15:41] ps8 now :) [14:16:52] ack cheeers looks good to me, gehel mateusbs17 ? [14:17:58] bblack jbond42 LGTM [14:19:37] ok, giving it a spin! [14:24:00] ok the prefix whitelisting filter thingy is fully deployed now [14:26:55] ack, thx [14:27:21] I'm guessing the load may take a while to react while existing in-progress hangs timeout, etc [14:27:53] yes maps1003 is going down but wil give it 30 mins before calling it a success [14:30:28] Is it possible to see the now filtered bad requests? So I can move the input validation to kartotherian? [14:32:24] the filtered bad reqs will still show up in e.g. webrequest sampled analytics [14:32:52] but the input validation should probably not depend on what external traffic looks like, but rather the hard canonical truth of what the legit URLs actually are [14:36:33] After reviewing the information you provided, we believe that you may be affected by a known issue. Here are some details about the issue: [14:36:33] Description: [14:36:33] For some users, Calendar fail to load events with "Calendar could not load the data. Please try reloading later" [14:36:33] Workaround: [14:36:33] There's no workaround available at this time, but we'll let you know if we learn of one. [14:37:15] Good thing we're paying for support... [14:37:25] https://github.com/kartotherian/kartotherian/blob/e13b4535e1f199f41566d7ad019949eebe1f4f2a/packages/input-validator/validator.js [14:37:37] ^ apparently current master at least does have something that supposedly validates input [14:37:41] as of... 6 months ago? [14:37:58] although putting validation in a separate sub-package seems kind of like a bad smell for the rest of the code :) [14:38:40] and I'm not sure the degree of normalization there, it may just be checking for the most-egregious things [14:44:21] someone got the idea to split kartotherian into dozens of modules before our team inherited it [14:44:32] some do more or less stand on their own, others... not so much [14:45:24] putting it all back together again has never quite made it to the top of the todo list [14:45:54] Yeah, some refactors are happening in the stack but really slowly because of resources [14:47:53] I will wait until webrequest sampled analytics update the fixed issue and will investigate it better to improve input-validation in kartotherian [14:48:21] Thanks everyone! Please ping me if needed. [16:33:19] bblack and jbond42 I have one fix for the regex filter [16:33:20] https://gerrit.wikimedia.org/r/c/operations/puppet/+/536621 [16:33:39] when you have time, please let me know if you have any questions [16:54:17] mateusbs17: looks okto me a pretty simple but i would still prefer brandon push it as im still farily green with varnish [16:54:44] thanks! [17:50:54] yeah, pushing shortly [18:00:00] mateusbs17: deployed [19:10:44] bblack: thanks, it works fine [19:28:21] gehel: mateusbs17: any reason not to re-enable puppet on the maps hosts (back to normal state there)? [19:44:56] I don't think it's a problem to re-enable. But I wasn't around when it was disabled, I'll wait for Guillaume's reply. [20:54:00] bblack: no reason that I know of [20:59:27] I'll be back to a computer in ~2h I'll check the status then