[00:32:07] 10Wikimedia-Apache-configuration, 10ArchCom-RfC, 10Wikidata, 06Services (watching): Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3152755 (10Smalyshev) > Perhaps https://commons.wikimedia.org/data-mediainfo/File:Foo.jpg So what would be the content of ht... [06:09:15] 10Traffic, 06Commons, 06Operations, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3153014 (10Steinsplitter) @ema will this be fixed soon? If not i have to fix stuff & update the MediaWiki message o... [07:07:55] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3153092 (10Nemo_bis) [07:08:29] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10Nemo_bis) (Fixed summary to reflect the "original" bug repor... [07:27:53] 10Traffic, 10Varnish, 06Operations, 06WMF-Design, and 2 others: Better WMF error pages - https://phabricator.wikimedia.org/T76560#3153103 (10Nemo_bis) >>! In T76560#807460, @Nirzar wrote: > We were trying to populate this spread sheet with common errors > https://docs.google.com/a/wikimedia.org/spreadshee... [07:36:19] mutante: thanks for checking! Indeed it looks like lvs2002 went down for a bit https://grafana.wikimedia.org/dashboard/db/load-balancers?panelId=7&fullscreen&orgId=1&from=1491253773814&to=1491261627244 [07:44:08] is there a similar thing like "racadm getsel" for HP iLO? [07:44:20] it would be useful to see console events.. [07:44:57] elukey: not sure but there's a flood of usb-related messages in dmesg [07:44:58] [Tue Apr 4 05:17:13 2017] usb 3-1.3: New USB device strings: Mfr=0, Product=0, SerialNumber=0 [07:45:00] I found show /system1/log1 but it doesn't seem really helpful (or maybe I need coffee) [07:57:34] 10Traffic, 06Operations: lvs2002 hanging, usb messages flooding kernel logs - https://phabricator.wikimedia.org/T162117#3153143 (10ema) [07:57:46] 10Traffic, 06Operations: lvs2002 hanging, usb messages flooding kernel logs - https://phabricator.wikimedia.org/T162117#3153132 (10ema) p:05Triage>03Normal [08:17:03] 10Traffic, 10Varnish, 06Operations, 06WMF-Design, and 2 others: Better WMF error pages - https://phabricator.wikimedia.org/T76560#3153160 (10Nemo_bis) [08:19:36] 10Traffic, 10Varnish, 06Operations, 06WMF-Design, and 2 others: Better WMF error pages - https://phabricator.wikimedia.org/T76560#3153164 (10Nemo_bis) [08:24:11] 10Traffic, 06Commons, 06Operations, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3153178 (10ema) >>! In T161517#3153014, @Steinsplitter wrote: > @ema will this be fixed soon? If not i have to fix... [08:27:09] 10Traffic, 06Operations: lvs2002 hanging, usb messages flooding kernel logs - https://phabricator.wikimedia.org/T162117#3153180 (10ema) Oh and apparently the repeated USB messages have been reported already in T148017. [08:34:24] 10Traffic, 06Commons, 06Operations, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3153187 (10Nemo_bis) This is the only pending question, isn't it? >>! In T161517#3140732, @Krinkle wrote: > If we... [10:14:26] 10Traffic, 06Operations: lvs2002 hanging, usb messages flooding kernel logs - https://phabricator.wikimedia.org/T162117#3153340 (10ema) [10:14:48] 10Traffic, 06Operations, 10ops-codfw: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3152568 (10ema) p:05Triage>03Normal [10:50:12] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10jcrespo) [10:55:37] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153405 (10jcrespo) Origin ips (under NDA): {P5199} The queries done are: ``` ?format=json&action=parse&page=[*title*]&prop=tex... [11:01:15] 10Traffic, 06Operations, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10TheDJ) I don't think that the purge was complete. This one h... [11:26:43] 10Wikimedia-Apache-configuration, 10ArchCom-RfC, 10Wikidata, 06Services (watching): Canonical data URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3153431 (10daniel) >>! In T161527#3152755, @Smalyshev wrote: >> Perhaps https://commons.wikimedia.org/data-mediainfo/File:Foo... [11:50:37] 10Traffic, 06Operations, 10ops-esams: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132#3153465 (10ema) [11:50:45] 10Traffic, 06Operations, 10ops-esams: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132#3153480 (10ema) p:05Triage>03Normal [12:16:57] 10Traffic, 06Operations, 10ops-esams: cp3003 network interface issues - https://phabricator.wikimedia.org/T162132#3153567 (10ema) I've tried a "cold reboot" with `racadm serveraction powerdown ; racadm serveraction powerup` to no avail. [12:44:23] bblack: hey :) [12:44:34] so it looks like we're still serving some images with wrong CT https://phabricator.wikimedia.org/T162035#3153406 [12:45:32] perhaps bad timing yesterday with ban/workaround deployment? [12:46:08] I'd run the ban again to be sure [12:46:51] it's worth a shot [12:47:27] salt -b 1 -v -t 30 -C 'G@cluster:cache_upload and G@site:eqiad' cmd.run "varnishadm ban 'obj.http.content-type == \"text/html; charset=UTF-8\"'" [12:48:05] and then same story with G@site:codfw [12:48:36] followed by not G@site:codfw and not G@site:eqiad and finally the frontends [12:50:33] bblack: does that seem reasonable? [12:52:15] ema: yeah except maybe the content-type match could be broader just in case [12:52:38] maybe: 'obj.http.content-type ~ \"^text\"' [12:53:55] yeah I thought of that too but I wasn't sure if a regex ban could be worse in terms of potential post-ban issues with mailbox lag [12:54:33] eh probably not [12:54:53] str regex vs equality is a very small cpu perf hit, most of the problem is iterating storage structures themselves [12:57:32] ok then [13:28:18] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153673 (10jcrespo) Seems to have stopped for now since 12:34 UTC: https://grafana.wikimedia.org/dashboard/db/api-summary?panelId... [13:31:18] so this time I've waited for the ban lurker to be done at each banning step (trying to be nice!) [13:31:45] now the frontend bans are being processed and we had a brief 503 spike in ulsfo exactly when I started the frontend bans (coincidence?) [13:40:16] probably not coincidence [13:40:25] interesting that it's the frontend ban start though [13:40:40] was that maybe correlated with other stats spiking on the backends it accesses? [13:40:58] the FE mem cache should be the "easy" case for the ban (smaller and faster) [13:41:29] but maybe it's that shortly after the ulsfo backends' ban lurker finishes, some storage book-keeping kicks in and causes reqs from ulsfo FEs to fail a bit [13:42:55] ouch problem not solved yet :( [13:43:10] I'm staring at this on esams frontends: [13:43:10] varnishncsa -n frontend -q 'RespHeader ~ "Content-Type: text/html" and RespStatus eq 200' -F '%{Content-Type}o %r' [13:43:36] there are a few requests matching, eg: [13:43:39] https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/TabulaRogeriana.jpg/1920px-TabulaRogeriana.jpg [13:43:46] so is it possible the fix doesn't fix anything? [13:44:15] I mean, I don't think we have experience hacking on 304 responses - maybe they don't use the same control flow, or are already converted into virtual-200s before they hit backend_response, or some such issue [13:44:37] yep, it does certainly look like not all CT are appropriate [13:45:08] maybe start back at the varnish control flow diagrams and map out what should be happening on 304-refresh of a keep "hit" [13:45:28] (maybe needs a hook in a different spot in the diagram to catch + mod one with real effect) [13:45:47] this seems like a good excuse to write a vtc test :) [13:46:11] of course! :) [13:48:21] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153759 (10jcrespo) 05Open>03stalled [13:58:32] ema: was noticing caches are behind on some minor package updates anyways (e.g. libs, some linux-base, linux-meta stuff, etc) [13:58:49] ema: maybe should apt-get -y upgrade them all before kernel reboots too [13:59:13] bblack: yeah I was discussing this with moritzm, he's taking care of the wireshark updates [13:59:54] the rest, we'll want to sync them all up to date with upstream eventually anyways, may as well get a fresh boot on the new packages [14:00:07] he's testing https://gerrit.wikimedia.org/r/#/c/346162/ [14:00:14] ok [14:00:25] tests have been successful, now only needs review :-) [14:00:39] on cp* only the kernel and wireshark are pending [14:00:53] and at some point we should spread the general-updates + 4.9 stuff to lvs+authdns too [14:00:58] (and for some reason that stupid debconf prompt now triggers for wireshark) [14:38:07] 10netops, 06Operations, 10ops-esams: esams higher than usual temperature - https://phabricator.wikimedia.org/T162152#3153928 (10faidon) [14:40:14] ha! [14:40:25] https://phabricator.wikimedia.org/P5202 [14:40:46] only the last assertion fails [14:41:29] that's because the second request triggers a IMS request in the background and we still get the right CT as a response from varnish [14:42:09] but then indeed as soon as the 304 comes in varnish updates CT setting it to text/html, with and without our workaround [14:43:48] maybe because the 304 does not go through _backend_response? [14:50:33] oh noes! [14:50:45] status gets set to 200 *before* calling vcl_backend_response [14:50:58] https://varnish-cache.org/docs/4.1/users-guide/vcl-built-in-subs.html#vcl-backend-response [14:51:16] so we need that stupid was_304 or whatever it's called [14:52:44] yeah that was it, beresp.was_304 [15:02:39] bblack: https://gerrit.wikimedia.org/r/346304 [15:04:45] hhm no hold on a sec [15:05:25] resp.http.Content-Type is actually in the last assertion ?!? [15:05:44] madness [15:08:41] don't tell me that we need to save Content-Type in X-Orig-Content-Type and then restore it [15:10:30] <_joe_> ahahahahahah [15:10:46] <_joe_> so we're back to fixing swift? [15:12:31] well that doesn't seem like a bad idea regardless :) [15:12:54] * volans hides [15:13:09] how hard can it be (TM) [15:14:10] ema: to remove the header should be rather easy, to use the right one not that much, the old version structure doens't have an easy way to pass parameter to the 304 IIRC the code I've seen yesterday [15:14:37] volans: goal is to remove CT on 304 responses [15:15:26] and we'll be patching only half of the fleet [15:15:43] > Any headers not present in the 304 response are copied from the existing cache object [15:15:45] ema: yes but the newer version return the proper one [15:15:59] so we'll still have a difference in behaviour [15:16:06] old: no header, new: correct header [15:16:10] for CT [15:16:11] volans: whatever? :) [15:16:42] of that doesn't generate issues with the cache and/or the VCL [15:16:47] s/of/if/ [15:17:13] well the gospel says it shouldn't [15:17:18] let's see if it's true [15:18:37] yeah, it's true [15:19:44] no CT == correct CT on 304? [15:20:24] no CT on 304 == varnish leaves the original CT returned with the 200 response alone [15:21:22] luckily, otherwise I'd consider a career in orange picking [15:21:37] lol [15:22:14] so it's the same, from the cache point of view, of a correct (as in same of 200) CT returned from the backend on a 304 [15:22:42] yes [15:24:02] then at least we know it's a possibility to go the fix swift way :D [16:06:28] alternatively, if fixing swift does seem like taking a bit too long, we could do something along these lines: [16:06:31] if (beresp.was_304 && beresp.http.Content-Type ~ "^text/") { [16:06:33] if (bereq.url ~ ".png$") { [16:06:36] set beresp.http.Content-Type = "image/png"; [16:06:38] } else if (bereq.url ~ ".jpg$") { [16:06:41] set beresp.http.Content-Type = "image/jpeg"; [16:06:43] } [16:06:46] } [16:07:00] which is ugly but arguably better than messing with copying Content-Type around in support headers? [16:07:10] bblack: thoughts? :) [16:10:31] yeah well with (?i) most likely [16:15:15] quite ugly indeed! [16:15:36] yeah bleah not to mention gif, svg, .... [16:16:35] can at least re-use the matching of the regex in teh set? [16:16:53] jpg vs jpeg and meh [16:17:00] so you could match a list of extensions (still riscky to not catch them all though) [16:17:26] s/riscky/risky/ [16:17:28] no really at this point X-Orig-Content-Type does look better (go figure!) [16:26:43] yeah this looks much better [16:26:49] https://gerrit.wikimedia.org/r/#/c/346304/2/modules/varnish/templates/upload-backend.inc.vcl.erb [16:36:26] and it even almost works! :) [16:42:08] bblack: https://gerrit.wikimedia.org/r/#/c/346304/ [17:47:11] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10Legoktm) > Requests do not have a user agent There's no user-agent header at all or is it some generic UA? [17:57:10] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations, 05Security: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3154452 (10MaxSem) [17:58:15] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10MaxSem) [18:08:31] ema: LGTM overall, nice work, I left one paranoia-nit comment in there [18:10:48] and of course there's a caveat that with the X-Orig based solution, it will take a few days for the fix to work 100% of the time, as the cache entries refill with the 3d TTL [18:11:04] and the 3d keep could keep the X-Orig-lacking entries in there quite a while longer than that... [18:11:37] so we could merge up that solution, then maybe re-do the CT ban 1-2 times a day until we get to that 3d mark. [18:11:53] after that, we might look at some way to purge everything older that's been keep-refreshing somehow [18:12:26] ugh, sorry that you have to deal with this guys :/ [18:12:44] hopefully it's just until next week when filippo can handle the swift upgrades at peace [18:12:55] do you have any sense of why this started being a problem now? [18:13:39] paravoid: I think we've had this problem for a long time, it's just been low-rate and unlikely to get reported significantly. We did have a phab ticket and report a few months back. [18:13:59] I think the shift of ttl/keep values likely exacerbated it [18:14:18] (we could revert those and maybe it will reduce the incidence rate too, but it's only a theory anyways) [18:14:26] yeah [18:17:25] I think we've actually aggravated this twice, first was about a month ago, when we first realized that we weren't setting a useful "keep" value to make better use of conditional fetches, and we initially set it to 1d keep (after the 7d ttl) [18:17:43] then a week ago or so we switch to 3d TTL + 3d keep, so now those conditional fetches are even more likely [18:18:15] (but I think even before that first change to enable keep, varnish was doing conditional fetches, just only in grace-window refreshes when an object gets hit near its end-of-life) [18:22:47] ema: maybe for now, just for cache_upload, we should set keep lower until swift gets fixed (in addition to the fixup above). At least back to 1d like it was before the ramp up in complaints, maybe even back to zero and just rely on grace temporarily (it worked ok for a long time, just not ideal) [18:52:22] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3154681 (10jcrespo) User agent was "-" (without quotes). [18:57:28] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10MaxSem) We used to block API requests that provided no UA - anybody remembers why did we stop doing that? [19:29:15] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10Tgr) >>! In T162129#3154681, @jcrespo wrote: > User agent was "-" (without quotes). More likely, nothing at all. The... [19:33:54] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3154911 (10Tgr) Did the IPs change periodically or did they actually use 50 boxes to query the API in parallel? The second case s... [19:54:07] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3154951 (10Tgr) Seems to have restarted (at least based on raw GET volume, haven't looked at what type it is). See P5199#27747 f... [20:05:55] 10netops, 06Operations: Audit and cleanup border-in ACL on core routers - https://phabricator.wikimedia.org/T160055#3154987 (10faidon) 05Open>03Resolved a:03faidon I just deployed a change which puts 224/4 back to special-ranges4 and nothing seems to be broken. [20:16:12] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3153389 (10Anomie) >>! In T162129#3154715, @MaxSem wrote: > We used to block API requests that provided no UA - anybody remembers... [20:17:50] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3155015 (10jcrespo) He is back, and now trying to parse Special pages, too :-) > Did the IPs change periodically or did they act... [20:39:44] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3155110 (10Anomie) The simple solution may be to just block the IPs in varnish or the like, perhaps delivering a message like "If... [20:56:42] 10Traffic, 10DBA, 10MediaWiki-API, 06Operations: Someone is parsing all enwiki pages using the action api at a rate of ~2M pages/hour - https://phabricator.wikimedia.org/T162129#3155143 (10Tgr) > I don't think it is malign, just parallelizing queries to load balancing source IPs (always the same ones). Ye... [21:52:53] 10netops, 06Operations, 10hardware-requests, 10ops-codfw, 13Patch-For-Review: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3155377 (10Dzahn) [21:54:35] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3155380 (10BBlack) [22:32:13] 10Traffic, 10Librarization, 10MediaWiki-extensions-CentralNotice, 06Operations, 07Privacy: Split GeoIP into a new component - https://phabricator.wikimedia.org/T102848#3155492 (10Reedy) [22:36:13] 10netops, 06Operations, 10ops-eqiad: Faulty optics on asw-b-eqiad:xe-1/1/2 - https://phabricator.wikimedia.org/T162199#3155501 (10Reedy) [22:48:59] 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3155545 (10Dzahn) I signed Arzhel's GPG key after he read the fingerprint to me over Hangout. gpg --fingerprint 58E24182 Key fingerprint = 8F89 0CBB E7BE...