[06:15:10] mutante: nice catch (re: pointless prometheus yaml file updates). I think https://gerrit.wikimedia.org/r/#/c/425218/ should do the trick [06:16:13] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Add UDP monitor for pybal - https://phabricator.wikimedia.org/T178151#4119325 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [07:03:01] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119368 (10Marostegui) This is the list of slaves per section we'd need to depool before starting this maintenance: s1: db1089 main db1105 rc s2: db1060 vslow db1090 main db1105... [07:37:26] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119423 (10jcrespo) I would honestly move x1 replica (or the master directy), probably in a logical way, somewhere else- we don't want to serve the whole service from the same row,... [07:40:17] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119429 (10Marostegui) >>! In T187962#4119423, @jcrespo wrote: > I would honestly move x1 replica (or the master directy), probably in a logical way, somewhere else- we don't want t... [07:43:08] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119434 (10jcrespo) I would do the second. [07:43:40] 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119435 (10jcrespo) [13:02:46] bblack: please let us know what are your thoughts regarding https://gerrit.wikimedia.org/r/#/c/425040/ [13:23:27] vgutierrez: +1 :) [13:24:51] bblack: as I commented with ema, I'll disable puppet on the primary LVSs, and I'll check how it behaves on the secondary ones [13:25:04] just to be safe [13:26:03] BTW, https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-http-requests?orgId=1 looks saner than the statsd equivalent: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?orgId=1 [13:26:16] and you can see (on the prometheus one) how eqsin killed ulsfo :) [13:26:17] vgutierrez, gehel: let's keep an eye on lvs1006 (now w/ UDP monitoring enabled) for the next hours and then if nothing burns restart the other LVSs to pick up the config changes [13:26:28] ema: ack :) [13:27:51] ema: BTW, I'm moving forward with https://gerrit.wikimedia.org/r/#/c/421925/ [13:27:58] it already smells like a rotten CR [13:28:01] "eqsin killed the ulsfo" must be the official traffic team karaoke song [13:28:47] // !log beers on eqsin killing ulsfo O:) [13:29:03] bblack: ok to get rid of varnishxcache? https://gerrit.wikimedia.org/r/#/c/421925/ [13:29:20] he already gave me a +1 on IRC a week ago :) [13:29:34] ok then! [13:30:02] yeah [13:30:37] so I'm deleting the old dashboard before getting that merged [13:31:43] perhaps we might also drop the prometheus- prefix from the name of the prometheus-based dashboard once the statsd-based one is gone? [13:31:53] indeed [13:31:59] I did that with the TLS one I think [13:32:06] or at least I thought about it [13:33:19] done: https://grafana-admin.wikimedia.org/dashboard/db/varnish-caching [13:33:50] \o/ [13:34:22] I kept the prometheus tag though [13:34:36] yeah that makes sense [13:35:41] one day we'll migrate from Prometheus to Heracles and that info will be very useful :) [13:35:59] *sigh* [13:36:12] ema: yeah eqsin traffic is higher than expected for sure :) it beats average rates at ulsfo and codfw handily now. it's got about 75% the avg reqs of eqiad, or about 33% of the avg reqs of esams. [13:41:38] bblack: nice! While still having a great hitrate [13:41:52] 98.3% right now [13:42:22] sigh.. puppet is not happy with my varnishxcache change [13:42:33] pcc was happy though, wtf [13:42:45] that happens :) [13:43:02] Error: /Stage[main]/Varnish::Logging/Varnish::Logging::Xcache[xcache]/File[/usr/local/bin/varnishxcache]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/varnish/varnishxcache [13:45:14] I reran puppet agent on cp2016 and worked... bad timing? [13:46:19] same with cp3042 [13:46:47] yeah, I've seen this happening in the past when removing a file. Just run puppet again where it fails I guess [13:48:41] I have to go afk for a bit, see you * later [13:58:44] it went smoothly on lvs5003 :D [13:58:46] vgutierrez@neodymium:~$ sudo cumin 'R:profile::lvs::interface_tweaks' [13:58:46] 1 hosts will be targeted: [13:58:47] lvs5003.eqsin.wmnet [13:58:49] <3 [13:59:36] vgutierrez: pro-tip P:lvs::interface_tweaks ;) [13:59:55] O: == role:: P: == profile:: [13:59:57] volans: hahahaha I love your cumin trigger [14:02:39] ;) [14:04:44] it ran smoothly on very DC [14:04:46] lvs[2004,2006].codfw.wmnet,lvs5003.eqsin.wmnet,lvs[3003-3004].esams.wmnet,lvs4007.ulsfo.wmnet,lvs1006.wikimedia.org [14:04:49] *every [14:11:49] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4120303 (10Tgr) [14:23:40] I reenabled puppet on the primary LVSs after checking that every secondary was behaving as expected and primaries on eqsin as well [14:23:58] I didn't trigger an icinga XMAS tree this time O:) [14:32:48] 10Traffic, 10Operations, 10Pybal: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4120349 (10Vgutierrez) p:05Triage>03Normal [14:45:39] https://gerrit.wikimedia.org/r/#/c/425278/ --> this should be enough to reimage lvs5003 as stretch and handle "predictable" network interface names [14:51:08] vgutierrez: I think it's heiradata: profile::pybal::bgp: "no" . and then for deploy, should probably stop pybal and puppet agent on lvs5003, then merge up change (and puppet agent the dhcp server for the stretch switch, install1002 I think?), then reboot for reinstall [14:51:28] (so that nothing tries to puppetize any of this change live on lvs5003 before reinstall) [14:52:17] yup.. I'm amending the commit right now [14:56:17] and regarding disabling puppet before merging, I completely agree [14:58:56] +1 [14:59:13] I'm stepping into the Meeting Zone for a while, I may be somewhat unresponsive! [14:59:23] ack :) [15:53:53] vgutierrez: wanna merge https://gerrit.wikimedia.org/r/#/c/424611/ too now? [15:55:06] yup, I was trying to solve the mgmt issue first [16:01:16] vgutierrez: mgmt issue? [16:01:40] ema: 623/UDP is not reachable from *.mgmt.eqsin.wmnet from cumin masters [16:01:57] s/from/on/ [16:02:06] so wmf-auto-reimage-host is failing [16:02:08] volans: thx :* [16:02:26] ok, we need an icinga alert for that [16:02:50] we have checks for what was deemed not to be destructive [16:03:09] the management card are (were?) known to fail if pinged too many times [16:03:17] grr [16:03:48] lol [16:04:03] so we check that ssh is responding on icinga and the dns is set and we can ping IIRC [16:04:18] we don't do ssh login or real remote IPMI check [16:04:26] *but* the reimage script uses that [16:04:37] so it should ensure that it works at installation time [16:05:29] more context on T169321 [16:05:30] T169321: Monitor all management interfaces - https://phabricator.wikimedia.org/T169321 [16:07:09] vgutierrez: as of things to check if remote IPMI is not working and the firewall is not the issue see T150160 and related [16:07:10] T150160: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160 [16:39:20] bblack: about my new wdqs-internal service (https://gerrit.wikimedia.org/r/#/c/424587/ & https://gerrit.wikimedia.org/r/#/c/424599/ ) was there anything else to correct? [16:42:47] hrmm [16:42:57] ok cp2022 is back online, but many icinga checks are yellow or red [16:43:14] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cp2022 [16:50:14] heh, 'Varnishkafka Delivery Errors per minute' is green since no serivce means no errors. [16:53:59] anyone know cp systems wanna assist on resurrection of cp2022? [16:54:14] (its oscp staple checks are critical when puppet has run on the host) [16:57:14] 10Traffic, 10Operations, 10ops-codfw: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4120807 (10RobH) Ok, So I just took this over from Papaul. He replaced the bad memory on the A side earlier today, but after just clearing the log and rebooting, we have more memory errors: Record:... [17:22:00] https://phabricator.wikimedia.org/T191905 [17:22:19] ema: that's the issue regarding ipmi on eqsin [17:22:30] volans: thx for helping with the debugging :* [17:22:41] XioNoX: and thanks to you too [17:23:06] I'll get it fixed tomorrow morning and after that I'll resume lvs5003 reimage [17:23:13] vgutierrez: no prob, it's wasy to fix, just change --diff with --commit [17:23:23] and then redo the --diff to check that is empty [17:23:38] *easy [17:23:56] cool [17:55:09] 10Wikimedia-Apache-configuration, 10Operations, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4063273 (10Dzahn) I investigated a bit on the part ".. on mwdebug1001 and mwdebug1002, .. behaves differently on... [18:02:57] 10Wikimedia-Apache-configuration, 10Operations, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4121053 (10Dzahn) The version of apache2.conf that canaries and mwdebug has matches the puppet repo template: me... [18:15:50] 10Traffic, 10Operations, 10ops-codfw: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4121087 (10Papaul) @BBlack we replaced the main board on cp2022 and the new NIC MAC address is:44:A8:42:2D:1E:80 I asked Dell tech to leave the memory for the other 3 servers cp2008, cp2011 and cp20... [18:18:54] 10Traffic, 10Operations, 10ops-codfw: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4121099 (10Papaul) Note: there is no need to re image the server because the MAC address is the same 44:A8:42:2D:1E:80; [18:38:13] 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4121210 (10BBlack) [18:38:20] 10Traffic, 10Operations, 10ops-codfw: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4121208 (10BBlack) 05Open>03Resolved all green in icinga now and repooled, closing! [18:47:29] 10Traffic, 10Operations, 10ops-eqsin: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4121254 (10Vgutierrez) [19:02:32] 10Traffic, 10Operations, 10ops-eqsin: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4121313 (10Volans) Reporting it here too for the future, to fix it's sufficient to replace the `--diff` of the above command with `--commit` and then re-run the `--diff` to ensure that this t... [19:07:14] 10Traffic, 10Operations, 10ops-eqsin: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4121328 (10Vgutierrez) p:05Triage>03Normal [21:47:05] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4121741 (10DFoy) @BBlack - not sure why OperaMini proxy IPs are no longer being exported. Can this information be re-established? My only cha... [23:14:06] 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Krinkle) [23:14:09] 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121980 (10Krinkle) [23:17:13] 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Paladox) I guess this is why en.wikipedia.org and phabricator.wikimedia.org would not load for me? (though gerrit.wikimedia.org loaded for me) [23:18:05] 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Ghouston) They still don't load for me. I think this is about April 10, not April 11. [23:18:42] 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121999 (10Ghouston) Well, phabricator is fine. [23:19:55] 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122000 (10Krinkle) [23:21:39] 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122003 (10Ghouston) Just started working again. [23:28:32] 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122016 (10Ghouston) And now dead again. Affects www.wikipedia.org, commons, wikidata, wiktionary. [23:43:05] 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122038 (10Krinkle) [23:44:20] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Krinkle) [23:50:36] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10ayounsi) This was caused by a change made for T191667, more specifically enabling nonstop-routing on cr1/2-eqiad. I applied the change to cr1-...