[02:51:54] 10Traffic, 10MediaWiki-extensions-CentralNotice, 06Operations: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#2711739 (10BBlack) If we want to go down that kind of road, it would probably be better efficiency-wise to have varnish set simpler request-side head... [07:22:29] good morning :) [07:23:13] so our data consistency checks alerts triggered this night (EU time), I collected data in https://etherpad.wikimedia.org/p/analytics-oozie-13102016 [07:23:20] only for upload [07:24:54] seems from 2016-10-12-17 to 2016-10-13-0 [07:24:59] err 2016-10-13-05 [07:25:49] in the etherpad I put the hosts that registered the missing data, we have it on hadoop [07:26:25] and from https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes I see some 503s but not a lot to justify this mess [07:26:46] I think that there might still be some requests ending up in VSL timeouts [07:27:06] so if you don't mind I'd like to start tmux sessions on a lot of upload hosts with varnishlog [07:27:16] to catch VSL timeouts when they re-occur [07:31:12] (I saw the 5xx alerts but afaics they don't match with the start of the issue, correct me if I am wrong) [08:05:18] elukey: yeah the consistency issues don't really seem to correlate with 503s [08:05:44] feel free to run varnishlog of course [08:46:40] ema: varnishlog -n frontend -L 5000 -T 1500 -q "VSL ~ timeout or VSL ~ overflow" - does it make sense? [08:47:25] or maybe varnishlog -n frontend -L 5000 -T 1500 -q 'VSL ~ "timeout" or VSL ~ "overflow"' [08:47:38] the latter looks better [08:50:44] nah, the former is also good (confirmed with -q 'ReqURL ~ Banana or ReqURL ~ Potato') [08:51:14] ahahhaah [08:51:36] I was thinking something like VSL is not empty [08:51:43] or you see a VSL tag [08:52:32] because if there is a new VSL error that does not contain the timeout|overflow words I won't see anything :D [08:57:50] varnishlog -n frontend -L 5000 -T 1500 -q 'VSL' might be enough [09:06:30] all right started varnishlog on 9 hosts in tmux (ulsfo, esams and eqiad) [11:05:23] I also tried to check timings for kafka1018 (disk broke, disk repaired, catch up with partitions, etc..) but not everything matches [11:06:10] it could be another lead, namely not VSL timing out but kafka connections related (so librdkafka or simply vk not being resilient to kafka failures enough) [11:13:19] 10Traffic, 10Analytics, 06Operations: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967#2712450 (10ema) p:05Triage>03Normal [11:14:06] !log mw1165 (MW Jobrunner) back in service after reimage [11:14:06] Not expecting to hear !log here [11:14:19] completely right [11:14:29] the bot has a point :) [11:14:41] :D [11:55:09] https://pinkunicorn.wikimedia.org/ running v4 :) [11:55:30] \o/ [12:04:31] \o/ [12:16:04] https://grafana.wikimedia.org/dashboard/db/varnishkafka - reworked a bit the vk's dashboard but don't see any clear indication that there were delivery errors of any sort (even if there are tons of metrics from librdkafka on graphite) [12:17:21] keep in mind pinkunicorn access will be mixed-version [12:17:40] it runs both frontend and backend, but it uses our standard puppetization/config, and it's not in the pool of backends for other frontends [12:17:53] so in practice, traffic entering pinkunicorn's v4 frontend then chashes to the standard pool of v3 backends [12:18:03] we could hack it to use itself only for a v4-only stack, though [12:18:10] yep [12:18:20] yeah, it would probably make sense generally as well [12:18:38] in the past it was more for TLS testing than varnish so it kinda didn't matter [12:18:42] but now, it probably does :) [13:41:44] 10Traffic, 10netops, 06Operations: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2712727 (10BBlack) Proposed subnet mapping: * eqiad ** high-traffic1 (lvs1001 + lvs1004) *** 208.80.154.224/28 (224-239) *** 2620:0:861:ed1a::0:0/111 (::0:0 - ::1:ffff) ** high-traffic2... [13:54:16] 10Traffic, 10netops, 06Operations: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2712836 (10faidon) I could find out, but since you've already done the investigation: do we need to renumber or relocate any IPs for this scheme to work? If so, which? [13:58:04] 10Traffic, 10netops, 06Operations: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2712848 (10BBlack) Audit result: All datacenters already obey the mapping above, except for 3x exceptions in eqiad: * ocg.svc.eqiad.wmnet - currently in high-traffic2, should be in low-... [13:59:55] bblack: so if I understand this right, I can deploy the static nets statics immediately [14:00:19] and I can add more-specifics for the three you mentioned [14:00:27] or just ignore them for now [14:01:46] yeah [14:01:57] if you ignore them for now, they'll fail to work if the pybals die [14:02:08] dns-rec-lb might be important enough to matter on that front [14:02:50] I haven't yet thought through the process of moving the other two, whether it's possible without a short outage [14:03:15] (very short, long enough to run puppet and restart pybal on two hosts) [14:04:17] dns-rec-lb, it would probably be easier to renumber it (set up a new IP in the right subnet on the same LVS, then switch all the resolv.conf stuff, then pull the old one after verifying lack of traffic) [14:06:55] dns-rec-lb all have IPv6 defined in DNS, too, but we don't template it out for actual resolver usage [14:07:09] and their current IPv6 match up with being in high-traffic2 where they all live, just not the eqiad v4 [14:14:39] bblack: If I remember correctly from the offsite we decided not to install openssl(1) into the 1.1 binary package, /usr/bin/c_rehash clashes as well, but should be even more harmless to skip in favour of the binary from 1.0.2 [14:18:34] yes, not installing the binary package (or even uploading it to carbon) [14:18:34] just the library package and the -dev [14:20:37] ok [15:44:28] 10Traffic, 06Analytics-Kanban, 06Operations: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713169 (10jcrespo) [15:47:27] 10Traffic, 06Analytics-Kanban, 06Operations: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713178 (10Zppix) Doesn't appear to affect the iOS 8.1 app for Wikipedia. [15:49:15] 10Traffic, 06Analytics-Kanban, 06Operations: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713181 (10jcrespo) These has been some of the updates we had recently: > We don't yet understand the full scope or specifics of either the > underlying issue GlobalSign is having, or any impa... [15:49:22] 10Traffic, 06Analytics-Kanban, 06Operations: MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713182 (10Zppix) [15:49:51] I found the repro for the varnishkafka data inconsistencies, namely the fb crawler trying to get [15:49:55] https://upload.wikimedia.org/wikipedia/commons/thumb/6/6e/Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg/720px-Miley_Cyrus_on_2015_Rock_and_Roll_Hall_of_Fame_Induction_Ceremony_%28cropped%29.jpg [15:50:30] on cp3046:/home/elukey/timeouts.txt there is a sample of the VSL log [15:50:51] that is full of only Link tags.. [15:52:09] 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713203 (10Zppix) [15:55:29] 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713217 (10Zppix) Clearing CertUlti on edge doesn't fix the issue [15:56:50] 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713223 (10ema) [15:58:28] 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713243 (10Zppix) [15:59:24] 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10Zppix) Edge is completely blocking access to WMF sites as shown in screenshot number 2 in the task description [15:59:48] 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713248 (10BBlack) [16:02:12] 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713281 (10Zppix) [16:02:24] 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10Zppix) [16:04:14] 10Traffic, 10netops, 10DNS, 06Operations, 10ops-esams: eeden ethernet outage - https://phabricator.wikimedia.org/T146391#2713300 (10BBlack) 05Resolved>03Open Down again! Assuming for the moment it's ethernet again... [16:06:05] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713305 (10Zppix) [16:10:50] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713344 (10Zppix) [16:12:26] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10Zppix) [16:20:14] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713388 (10Zppix) An user reports on IRC: earlier today (9-10 a.m. US eastern time) only *.wikimedia.org and wikimediafoundation.org sites were affected by the cert pro... [16:36:43] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713447 (10Zppix) A user in ENWIKI's help irc channel reports the error on Windows 10 Professional latest version, on Chrome - Version 54.0.2840.59 beta-m [16:37:36] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713448 (10Zppix) [16:44:58] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713503 (10Zppix) [16:48:34] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713526 (10Paladox) Chrome works for me still, seems to be spreading so may affect firefox soon. [16:49:54] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713529 (10Joe) @Paladox: firefox will keep working fine as it uses a different TLS stack from the one provided by the OS. [16:52:30] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10Pietrodn) > GlobalSign suggested the following workaround, it's unclear whether it actually works or not: https://support.globalsign.com/customer/portal/article... [16:53:16] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713539 (10Zppix) @Pietrodn so firefox doesnt work on mac? [16:54:20] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713543 (10Mholloway) [16:54:51] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713544 (10Pietrodn) >>! In T148045#2713539, @Zppix wrote: > @Pietrodn so firefox doesnt work on mac? Wikipedia on Firefox works fine on macOS Sierra. Seems to be the onl... [16:55:36] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713545 (10Zppix) @Pietrodn ack [17:09:31] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713587 (10Paladox) More detailed GlobalSign explanation of the problem https://twitter.com/globalsign/status/786612660397715456 [17:10:19] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713594 (10Pietrodn) More detailed explanation of the technical problem by GlobalCert: https://downloads.globalsign.com/acton/fs/blocks/showLandingPage/a/2674/p/p-008f/t/p... [17:13:34] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713607 (10Legoktm) [17:32:40] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713690 (10Pietrodn) Working workaround for Chrome and Safari on macOS Sierra: http://apple.stackexchange.com/a/257112/33925 ``` $ sqlite3 ~/Library... [17:37:19] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10BBlack) We've received an updated intermediate cert from GlobalSign that's compatible with our existing end-certs and supposedly fixes the... [18:32:49] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713909 (10BBlack) We're working through the other minor one-off cert issues now on smaller (mostly for technical folks sites), I'm breaking off a se... [18:41:19] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2713969 (10BBlack) [18:59:02] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714049 (10BBlack) [19:06:02] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714092 (10BBlack) [19:11:02] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2713969 (10MoritzMuehlenhoff) seaborgium and serpens use certs from our internal CA, not from GlobalSign. [19:12:41] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714139 (10BBlack) The ones in the puppet repo under files/ssl/ are signed by GlobalSign.... I wonder what's out of sync here? [19:19:02] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714151 (10MoritzMuehlenhoff) When we setup the openldap replacement servers for the OpenDJ setup, we started with an internal cert from the beginning. From what I... [19:20:17] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714152 (10BBlack) [19:25:57] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: GlobalSign intermediate updates for one-offs - https://phabricator.wikimedia.org/T148069#2714157 (10BBlack) These are all fixed up now I believe, except for the 3x externally-hosted sites, which still link to the R1 root.... [19:27:34] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2713145 (10hashar) OCG on ocg1001 ocg1002 ocg1003, started yielding CERT_UNTRUSTED error at 17:30 UTC One can monitor it via Grafana backend success... [19:30:47] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2714176 (10BBlack) [19:32:54] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2714191 (10BBlack) @akosiaris found https://github.com/nodejs/node/blob/db1087c9757c31a82c50a1eba368d8cba95b57d0/src/node_root_certs.h [19:54:45] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2714251 (10Nuria) Will get numbers for Mac OS requests on Chrome and Safari per hour for the last 3 days to quantify impact, let me know if you no lo... [20:27:44] bblack: I'm trying to find the ticket about being able to use cache misc with active/active services (in eqiad + codfw) [20:30:04] I found it: T134404 [20:30:04] T134404: Varnish support for active:active backend services - https://phabricator.wikimedia.org/T134404 [21:26:26] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2714531 (10BBlack) [21:26:30] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2714528 (10BBlack) 05Open>03Resolved a:03BBlack Resolved for now. To recap: Initial symptom was lots of errors the ocg logs after we deployed the... [21:30:27] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review: Windows 10 & MacOS Sierra Cert errors - https://phabricator.wikimedia.org/T148045#2714536 (10BBlack) @Nuria - it would have to be specifically for MacOS Sierra (the new version that came out less than a month ago). There were othe... [21:32:08] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2714537 (10hashar) That is a very nice fix and summary. Thank you! [21:38:15] 07HTTPS, 10Traffic, 06Analytics-Kanban, 06Operations: OCG failing with new GlobalSign intermediate workaround - https://phabricator.wikimedia.org/T148076#2714544 (10Volans) FYI it's worth noticing that the upgrade of NodeJS for this service looks a bit broken by design to me, given that `apt-get` will over... [23:01:42] 10Traffic, 06Operations, 10Phabricator, 13Patch-For-Review: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2714910 (10Dzahn) merged per prototype/"labs-only" no-op in prod http://puppet-compiler.wmflabs.org/4348/