[06:15:10] <ema>	 mutante: nice catch (re: pointless prometheus yaml file updates). I think https://gerrit.wikimedia.org/r/#/c/425218/ should do the trick
[06:16:13] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Add UDP monitor for pybal - https://phabricator.wikimedia.org/T178151#4119325 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez
[07:03:01] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119368 (10Marostegui) This is the list of slaves per section we'd need to depool before starting this maintenance:  s1:  db1089 main db1105 rc  s2: db1060 vslow db1090 main db1105...
[07:37:26] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119423 (10jcrespo) I would honestly move x1 replica (or the master directy), probably in a logical way, somewhere else- we don't want to serve the whole service from the same row,...
[07:40:17] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119429 (10Marostegui) >>! In T187962#4119423, @jcrespo wrote: > I would honestly move x1 replica (or the master directy), probably in a logical way, somewhere else- we don't want t...
[07:43:08] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119434 (10jcrespo) I would do the second.
[07:43:40] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4119435 (10jcrespo)
[13:02:46] <vgutierrez>	 bblack: please let us know what are your thoughts regarding https://gerrit.wikimedia.org/r/#/c/425040/
[13:23:27] <bblack>	 vgutierrez: +1 :)
[13:24:51] <vgutierrez>	 bblack: as I commented with ema, I'll disable puppet on the primary LVSs, and I'll check how it behaves on the secondary ones
[13:25:04] <vgutierrez>	 just to be safe
[13:26:03] <vgutierrez>	 BTW, https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-http-requests?orgId=1 looks saner than the statsd equivalent: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?orgId=1
[13:26:16] <vgutierrez>	 and you can see (on the prometheus one) how eqsin killed ulsfo :)
[13:26:17] <ema>	 vgutierrez, gehel: let's keep an eye on lvs1006 (now w/ UDP monitoring enabled) for the next hours and then if nothing burns restart the other LVSs to pick up the config changes
[13:26:28] <vgutierrez>	 ema: ack :)
[13:27:51] <vgutierrez>	 ema: BTW, I'm moving forward with https://gerrit.wikimedia.org/r/#/c/421925/
[13:27:58] <vgutierrez>	 it already smells like a rotten CR
[13:28:01] <ema>	 "eqsin killed the ulsfo" must be the official traffic team karaoke song
[13:28:47] <vgutierrez>	 // !log beers on eqsin killing ulsfo O:)
[13:29:03] <ema>	 bblack: ok to get rid of varnishxcache? https://gerrit.wikimedia.org/r/#/c/421925/
[13:29:20] <vgutierrez>	 he already gave me a +1 on IRC a week ago :)
[13:29:34] <ema>	 ok then! 
[13:30:02] <bblack>	 yeah
[13:30:37] <vgutierrez>	 so I'm deleting the old dashboard before getting that merged
[13:31:43] <ema>	 perhaps we might also drop the prometheus- prefix from the name of the prometheus-based dashboard once the statsd-based one is gone?
[13:31:53] <vgutierrez>	 indeed
[13:31:59] <vgutierrez>	 I did that with the TLS one I think
[13:32:06] <vgutierrez>	 or at least I thought about it
[13:33:19] <vgutierrez>	 done: https://grafana-admin.wikimedia.org/dashboard/db/varnish-caching
[13:33:50] <ema>	 \o/
[13:34:22] <vgutierrez>	 I kept the prometheus tag though 
[13:34:36] <ema>	 yeah that makes sense
[13:35:41] <ema>	 one day we'll migrate from Prometheus to Heracles and that info will be very useful :)
[13:35:59] <vgutierrez>	 *sigh*
[13:36:12] <bblack>	 ema: yeah eqsin traffic is higher than expected for sure :) it beats average rates at ulsfo and codfw handily now.  it's got about 75% the avg reqs of eqiad, or about 33% of the avg reqs of esams.
[13:41:38] <ema>	 bblack: nice! While still having a great hitrate
[13:41:52] <ema>	 98.3% right now
[13:42:22] <vgutierrez>	 sigh.. puppet is not happy with my varnishxcache change
[13:42:33] <vgutierrez>	 pcc was happy though, wtf
[13:42:45] <ema>	 that happens :)
[13:43:02] <ema>	 Error: /Stage[main]/Varnish::Logging/Varnish::Logging::Xcache[xcache]/File[/usr/local/bin/varnishxcache]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/varnish/varnishxcache
[13:45:14] <vgutierrez>	 I reran puppet agent on cp2016 and worked... bad timing?
[13:46:19] <vgutierrez>	 same with cp3042
[13:46:47] <ema>	 yeah, I've seen this happening in the past when removing a file. Just run puppet again where it fails I guess
[13:48:41] <ema>	 I have to go afk for a bit, see you * later
[13:58:44] <vgutierrez>	 it went smoothly on lvs5003 :D
[13:58:46] <vgutierrez>	 vgutierrez@neodymium:~$ sudo cumin 'R:profile::lvs::interface_tweaks'
[13:58:46] <vgutierrez>	 1 hosts will be targeted:
[13:58:47] <vgutierrez>	 lvs5003.eqsin.wmnet
[13:58:49] <vgutierrez>	 <3
[13:59:36] <volans>	 vgutierrez: pro-tip P:lvs::interface_tweaks ;)
[13:59:55] <volans>	 O: == role:: P: == profile::
[13:59:57] <vgutierrez>	 volans: hahahaha I love your cumin trigger
[14:02:39] <volans>	 ;)
[14:04:44] <vgutierrez>	 it ran smoothly on very DC
[14:04:46] <vgutierrez>	 lvs[2004,2006].codfw.wmnet,lvs5003.eqsin.wmnet,lvs[3003-3004].esams.wmnet,lvs4007.ulsfo.wmnet,lvs1006.wikimedia.org
[14:04:49] <vgutierrez>	 *every
[14:11:49] <wikibugs>	 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4120303 (10Tgr)
[14:23:40] <vgutierrez>	 I reenabled puppet on the primary LVSs after checking that every secondary was behaving as expected and primaries on eqsin as well
[14:23:58] <vgutierrez>	 I didn't trigger an icinga XMAS tree this time O:)
[14:32:48] <wikibugs>	 10Traffic, 10Operations, 10Pybal: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4120349 (10Vgutierrez) p:05Triage>03Normal
[14:45:39] <vgutierrez>	 https://gerrit.wikimedia.org/r/#/c/425278/ --> this should be enough to reimage lvs5003 as stretch and handle "predictable" network interface names
[14:51:08] <bblack>	 vgutierrez: I think it's heiradata: profile::pybal::bgp: "no" .  and then for deploy, should probably stop pybal and puppet agent on lvs5003, then merge up change (and puppet agent the dhcp server for the stretch switch, install1002 I think?), then reboot for reinstall
[14:51:28] <bblack>	 (so that nothing tries to puppetize any of this change live on lvs5003 before reinstall)
[14:52:17] <vgutierrez>	 yup.. I'm amending the commit right now
[14:56:17] <vgutierrez>	 and regarding disabling puppet before merging, I completely agree
[14:58:56] <bblack>	 +1
[14:59:13] <bblack>	 I'm stepping into the Meeting Zone for a while, I may be somewhat unresponsive!
[14:59:23] <vgutierrez>	 ack :)
[15:53:53] <ema>	 vgutierrez: wanna merge https://gerrit.wikimedia.org/r/#/c/424611/ too now?
[15:55:06] <vgutierrez>	 yup, I was trying to solve the mgmt issue first
[16:01:16] <ema>	 vgutierrez: mgmt issue?
[16:01:40] <vgutierrez>	 ema: 623/UDP is not reachable from *.mgmt.eqsin.wmnet from cumin masters
[16:01:57] <volans>	 s/from/on/
[16:02:06] <vgutierrez>	 so wmf-auto-reimage-host is failing
[16:02:08] <vgutierrez>	 volans: thx :*
[16:02:26] <ema>	 ok, we need an icinga alert for that 
[16:02:50] <volans>	 we have checks for what was deemed not to be destructive
[16:03:09] <volans>	 the management card are (were?) known to fail if pinged too many times
[16:03:17] <ema>	 grr
[16:03:48] <vgutierrez>	 lol
[16:04:03] <volans>	 so we check that ssh is responding on icinga and the dns is set and we can ping IIRC
[16:04:18] <volans>	 we don't do ssh login or real remote IPMI check
[16:04:26] <volans>	 *but* the reimage script uses that
[16:04:37] <volans>	 so it should ensure that it works at installation time
[16:05:29] <volans>	 more context on T169321
[16:05:30] <stashbot>	 T169321: Monitor all management interfaces - https://phabricator.wikimedia.org/T169321
[16:07:09] <volans>	 vgutierrez: as of things to check if remote IPMI is not working and the firewall is not the issue see T150160 and related
[16:07:10] <stashbot>	 T150160: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160
[16:39:20] <gehel>	 bblack: about my new wdqs-internal service (https://gerrit.wikimedia.org/r/#/c/424587/ & https://gerrit.wikimedia.org/r/#/c/424599/ ) was there anything else to correct?
[16:42:47] <robh>	 hrmm
[16:42:57] <robh>	 ok cp2022 is back online, but many icinga checks are yellow or red
[16:43:14] <robh>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=cp2022
[16:50:14] <robh>	 heh, 'Varnishkafka Delivery Errors per minute' is green since no serivce means no errors.
[16:53:59] <robh>	 anyone know cp systems wanna assist on resurrection of cp2022?
[16:54:14] <robh>	 (its oscp staple checks are critical when puppet has run on the host)
[16:57:14] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4120807 (10RobH) Ok, So I just took this over from Papaul.  He replaced the bad memory on the A side earlier today, but after just clearing the log and rebooting, we have more memory errors:  Record:...
[17:22:00] <vgutierrez>	 https://phabricator.wikimedia.org/T191905
[17:22:19] <vgutierrez>	 ema: that's the issue regarding ipmi on eqsin
[17:22:30] <vgutierrez>	 volans: thx for helping with the debugging :*
[17:22:41] <vgutierrez>	 XioNoX: and thanks to you too
[17:23:06] <vgutierrez>	 I'll get it fixed tomorrow morning and after that I'll resume lvs5003 reimage
[17:23:13] <volans>	 vgutierrez: no prob, it's wasy to fix, just change --diff with --commit
[17:23:23] <volans>	 and then redo the --diff to check that is empty
[17:23:38] <volans>	 *easy
[17:23:56] <vgutierrez>	 cool
[17:55:09] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4063273 (10Dzahn) I investigated a bit on the part ".. on mwdebug1001 and mwdebug1002, .. behaves differently on...
[18:02:57] <wikibugs>	 10Wikimedia-Apache-configuration, 10Operations, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4121053 (10Dzahn) The version of apache2.conf that canaries and mwdebug has matches the puppet repo template:  me...
[18:15:50] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4121087 (10Papaul) @BBlack  we replaced the main board on cp2022 and the  new NIC MAC address is:44:A8:42:2D:1E:80  I asked Dell tech to leave the memory for the other 3 servers cp2008, cp2011 and cp20...
[18:18:54] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4121099 (10Papaul) Note: there is no need to re image the server because the MAC address is the same 44:A8:42:2D:1E:80;
[18:38:13] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4121210 (10BBlack)
[18:38:20] <wikibugs>	 10Traffic, 10Operations, 10ops-codfw: cp2022 memory replacement - https://phabricator.wikimedia.org/T191229#4121208 (10BBlack) 05Open>03Resolved all green in icinga now and repooled, closing!
[18:47:29] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4121254 (10Vgutierrez)
[19:02:32] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4121313 (10Volans) Reporting it here too for the future, to fix it's sufficient to replace the `--diff` of the above command with `--commit` and then re-run the `--diff` to ensure that this t...
[19:07:14] <wikibugs>	 10Traffic, 10Operations, 10ops-eqsin: eqsin hosts don't allow remote ipmi - https://phabricator.wikimedia.org/T191905#4121328 (10Vgutierrez) p:05Triage>03Normal
[21:47:05] <wikibugs>	 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4121741 (10DFoy) @BBlack - not sure why OperaMini proxy IPs are no longer being exported.  Can this information be re-established?   My only cha...
[23:14:06] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Krinkle)
[23:14:09] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121980 (10Krinkle)
[23:17:13] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Paladox) I guess this is why en.wikipedia.org and phabricator.wikimedia.org would not load for me? (though gerrit.wikimedia.org loaded for me)
[23:18:05] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Ghouston) They still don't load for me. I think this is about April 10, not April 11.
[23:18:42] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-11 global traffic drop - https://phabricator.wikimedia.org/T191940#4121999 (10Ghouston) Well, phabricator is fine.
[23:19:55] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122000 (10Krinkle)
[23:21:39] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122003 (10Ghouston) Just started working again.
[23:28:32] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122016 (10Ghouston) And now dead again. Affects www.wikipedia.org, commons, wikidata,  wiktionary.
[23:43:05] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4122038 (10Krinkle)
[23:44:20] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10Krinkle)
[23:50:36] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-Incident: Investigate 2018-04-10 global traffic drop - https://phabricator.wikimedia.org/T191940#4121970 (10ayounsi) This was caused by a change made for T191667, more specifically enabling nonstop-routing on cr1/2-eqiad. I applied the change to cr1-...