[00:20:15] <wikibugs>	 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933087 (10Krinkle)
[00:29:14] <wikibugs>	 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933117 (10Krinkle)
[00:31:20] <wikibugs>	 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933120 (10Krinkle)
[00:33:44] <wikibugs>	 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933134 (10Krinkle)
[00:36:22] <wikibugs>	 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#1810236 (10Krinkle)
[00:37:55] <wikibugs>	 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933165 (10Krinkle)
[00:51:15] <wikibugs>	 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3933246 (10mobrovac)
[02:36:08] <wikibugs>	 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3933362 (10Krinkle)
[03:12:20] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3933395 (10Krinkle)
[03:13:09] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3786217 (10Krinkle)
[06:22:26] <elukey>	 hello people, just restarted the varnish backend on cp4024 since it was causing 503s 
[06:30:39] <mutante>	 oh, did you. thank you. i was looking but wasnt sure if i should do just that. next time i will. i saw it was limited to ulsfo upload but not that cp4024 specifically was the root source
[06:32:57] <mutante>	 thanks and recovery confirmed. out again
[15:58:42] <robh>	 bblack: how soon before we go onsite to replace ulsfo switches do we need to depool the site?
[15:58:50] <robh>	 im guessing over an hour due to TTL
[15:58:52] <robh>	 ?
[16:08:28] <bblack>	 robh: yeah, preferably.  Remind me again the start time and rough guestimate of the network downtime?
[16:09:20] <robh>	 im driving down at 930, but dont expect to get there until 10
[16:09:31] <robh>	 then XioNoX and i estimate 2-3 hours if things go well.
[16:09:44] <robh>	 or all day if things go terribly, but its likely a minimum of 3 hours.
[16:09:50] <robh>	 https://gerrit.wikimedia.org/r/#/c/407022/
[16:14:31] <bblack>	 robh: yeah I'd push it now, and !log it too
[16:15:15] <robh>	 so ive done this before but can someone else santiy check +1 my patch?
[16:15:32] <bblack>	 done!
[16:15:33] <robh>	 thx =]
[16:16:32] <robh>	 ok, pushed and ulsfo is now geodns depooled.
[16:16:48] <robh>	 and of course we'll check inbefore we start yanking shit out of rack ;D
[16:17:05] <robh>	 XioNoX: Get that bike ready! ;D
[16:17:26] <XioNoX>	 I'll be faster than you in traffic :)
[16:21:26] <bblack>	 lol
[16:35:49] <robh>	 by quite a bit yeah
[16:36:05] <robh>	 pretty sure your biking there is largely unaffected by street traffic ;]
[17:43:58] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3934617 (10BBlack) TL;DR: The network itself doesn't seem to be at fault.  Whatever this is, it probably affects esams more than othe...
[18:40:33] <robh>	 bblack: so im about to unplug ulsfo ssystems
[18:40:44] <robh>	 should i shut them down or anything to help them recover or just leave them online?
[18:40:50] <robh>	 other than manually depool that is
[18:41:06] <robh>	 they're going to remain powered up and online in OS unless you specify otherwise =]
[18:43:20] <robh>	 I assume i need to depool them one by one
[18:47:45] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3934835 (10RobH)
[18:47:53] <robh>	 Well, thats what I did so if I was wrong let me know!  
[18:48:13] <robh>	 since they are etting unplugged seemed safer to set to manual depool so no auto repooling happens...
[18:50:45] <robh>	 see -operations i just caused pager storm fora depooled site
[18:54:12] <robh>	 bblack: Ok, so now they are all back to manually pooled
[18:54:18] <robh>	 however im about to start pulling their network connections
[18:54:33] <bblack>	 so, before more pager storms ensue
[18:54:39] <robh>	 So what exactly should we maint in icinga?
[18:54:50] <robh>	 iver already set all the cp systems themselves and all services to maint.
[18:54:54] <bblack>	 now that the pooling state is correct, just make sure icinga downtimes are set for: all ulsfo hosts, and all ulsfo LVS checks
[18:55:29] <robh>	 ok, ill add lvs checks into maint now
[18:56:23] <bblack>	 robh: "since they are getting unplugged"? from power?
[18:57:06] <bblack>	 hmm no, says "remain powered up" a few lines above that, I guess you mean from the switch.
[18:57:39] <bblack>	 either way, the confctl per-host depooling is for isolated cases.  We never mass-depool a whole site's cp servers out of confctl.  It doesn't work and causes unecessary pain.
[18:58:08] <robh>	 just unplugged from switch
[18:58:12] <bblack>	 ok
[18:58:50] <robh>	 which im about to start doing now if thats ok?  i set maint mode for the text and upload.ulsfo.wikimedia.org
[18:59:01] <robh>	 which i missed before =P
[18:59:51] * robh is standing by that its ok
[19:01:38] <bblack>	 the LVS checks look ok.  You probably want to downtime the necessary network devices, too.
[19:02:09] <bblack>	 asw-ulsfo at least I imagine, but not the routers or oob?  maybe also ripe-atlas
[19:03:09] <robh>	 true, will do now
[19:03:10] <bblack>	 also, we could avoid ipsec spam by downtiming all of those for basically-everything except esams ipsec hosts
[19:03:25] <robh>	 you wanna downtime the ipsec stuff then?
[19:03:30] <bblack>	 yeah I'll poke at it
[19:04:17] <bblack>	 it's a PITA, as there's no easy regex search for what to downtime for ipsec heh
[19:05:21] <robh>	 im downtiming the access switches, oob and atlas
[19:06:45] <bblack>	 ipsec downtimes done
[19:06:46] <robh>	 ok done
[19:07:27] <bblack>	 is mr1 affected in general? or just mr1.oob?
[19:08:29] <bblack>	 either way, I think that's the only one I see where I don't know for sure
[19:08:44] <robh>	 ill just set it all
[19:09:15] <robh>	 i set oob and its the other one technically yeah
[19:10:21] <bblack>	 ok
[19:11:14] <bblack>	 I downtimed the ulsfo-specific 5xx rate checker too, although honestly if that goes off, probably one of the broader all-sites ones will go off pointlessly as well, which we can't disable.  the checks just aren't ideal.
[19:11:24] <bblack>	 robh: lgtm
[19:12:08] <bblack>	 robh: apparently not all ulsfo hosts downtimed :)
[19:12:11] <XioNoX>	 I'm monitoring this channel.If something goes wrong let me know
[19:12:35] <robh>	 well shit i forgot lvs
[19:12:38] <robh>	 sorry, still sick =P
[19:12:46] <robh>	 at least those hosts dont page
[19:13:15] <bblack>	 and dns.  downtiming all those
[19:14:18] <bblack>	 anyways, keep going
[19:15:24] <bblack>	 TODO: maybe there should be some kind of virtual whole-datacenter object in icinga that can be downtimed to suppress everything in that DC, and all the other things depend on it (just to be used for scenarios like these)
[20:29:40] <wikibugs>	 10Traffic, 10Operations: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968#3929788 (10BBlack) In both cases the child was killed with signal 9 by the kernel oom-killer.  It may be the case that our memory cache sizing is very tight in general, and that overheads have increased...
[20:33:33] <robh>	 fyi: on ulsfo switch swap.  new switches are racked and all wired up and arzhel is working on config stuff now
[20:38:31] <XioNoX>	 config is all done
[20:38:46] <XioNoX>	 working on dns monitoring now
[20:42:38] <bblack>	 dns monitoring?
[20:42:49] <bblack>	 oh, I see now, ignore that question!
[20:43:25] <bblack>	 I'll deal with invalidation/repool stuff once everything's up and going.  You can leave it running-but-dns-depooled at that point.
[20:46:12] <bblack>	 (my current thinking on the invalidation stuff is probably to repool the site with cache contents as-is, and then do some rolling cache wipes (via daemon restarts) over a reasonable timeframe to get past the missed-invalidation problems)
[21:06:21] <wikibugs>	 10Traffic, 10Operations, 10Wikimedia-Site-requests: oudated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#3935268 (10bd808)
[21:19:16] <robh>	 removing scheduled downtimes in icinga for the hosts in ulsfo
[21:19:24] <robh>	 lvs cp and mr1 removed
[21:19:32] <robh>	 not touching asw since itll be a new asw hostname
[21:20:20] <XioNoX>	 Can someone check if icinga is working fine? I renamed asw-ulsfo to asw2-ulsfo, and the first puppet run showed an issue when reloading puppet, 2nd one was all fine
[21:21:13] <mutante>	 Total Errors:   1
[21:21:23] <mutante>	 Error: 'asw-ulsfo' is not a valid parent for host 'cp4021' (
[21:21:41] <mutante>	 puppet will not restart it when the config check fails 
[21:21:50] <mutante>	 so it's not broken but it would be on restart
[21:22:38] <XioNoX>	 mutante: the run on einsteinium showed:
[21:22:38] <XioNoX>	 -	parents                        asw-ulsfo
[21:22:38] <XioNoX>	 +	parents                        asw2-ulsfo
[21:23:01] <XioNoX>	 so it's doing the rename at some point
[21:23:45] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-ulsfo: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3935345 (10RobH)
[21:24:18] <mutante>	 XioNoX: i'll try to fix it and see if puppet reverts 
[21:24:38] <XioNoX>	 mutante: that's the change I made: https://gerrit.wikimedia.org/r/#/c/407059/1/modules/netops/manifests/monitoring.pp
[21:25:33] <mutante>	 i think maybe puppet has to run on all the cp hosts.. and then on einsteinium too
[21:25:37] <mutante>	 checking
[21:25:59] <mutante>	 it did it for some hosts but not for cp4021.. running puppet on both now
[21:26:26] <mutante>	 -	parents                        asw2-ulsfo
[21:26:27] <mutante>	 +	parents                        asw-ulsfo
[21:26:34] <mutante>	 it reverts me actively...
[21:28:42] <robh>	 \o/
[21:30:47] <mutante>	 that wasn't a good thing so far:)  but now it did the opposite after i ran puppet on cp4021 and einsteinium again
[21:32:32] <mutante>	 XioNoX: fixed now.  Total Errors:   0
[21:32:45] <mutante>	 and it didn't re-break it on next run so far
[21:33:17] <robh>	 cool
[21:33:22] <XioNoX>	 mutante: so what's the proper order?
[21:33:40] <mutante>	 XioNoX: puppet run on all cp* hosts and then on einsteinium
[21:33:40] <robh>	 host server then einstiem to clear seems like?
[21:33:43] <robh>	 cool
[21:33:48] <XioNoX>	 ok, thanks
[21:33:56] <mutante>	 and the check to get more info:
[21:33:56] <mutante>	 [einsteinium:~] $ sudo icinga -v /etc/icinga/icinga.cfg 
[21:34:03] <robh>	 anyone running on 4024 yet?
[21:34:31] <robh>	 i am.
[21:35:50] <robh>	 cp4024 is just sitting on loading facts =P
[21:36:13] <mutante>	 that's also the one that needed to be kicked yesterday, heh
[21:36:15] <robh>	 its remotely accessible so its not an onsite issue, even thgouh its caused yb onsite work...
[21:36:20] <robh>	 fuck typos
[21:36:55] <robh>	 mutante: kicked as in reboot?
[21:37:45] <mutante>	 robh: as in "restart varnish backend" it caused 5xx before and then it was fixed after elukey did that
[21:37:45] <robh>	 no hardare failure events in sel
[21:37:56] <robh>	 oh, well its still sitting on laoding facts
[21:37:59] <robh>	 so something is fucked up.
[21:38:17] <robh>	 lets restart it and try again...
[21:38:24] <robh>	 restart the puppet run that is.
[21:38:37] <robh>	 bleh, so far same issue.
[21:42:05] <robh>	 Not sure whats up with it.
[21:42:41] <bblack>	 so
[21:43:05] <bblack>	 scrolling back re: icinga, I'm pretty sure the magical dependencies on e.g. asw-ulsfo from various $hosts comes from their lldpd-based $facts
[21:43:33] <bblack>	 so they should fix themselves after a cycle of: all the ulsfo hosts running puppet agent, running icinga again on the icinga master(s)
[21:43:39] <robh>	 So the only host that seems in a bad state is cp4024
[21:43:45] <robh>	 which is stuck on loading facts in puppet each attempted run
[21:43:49] <bblack>	 checking
[21:43:53] <robh>	 ill cancel my run out 
[21:44:00] <robh>	 so you can try it and see if you spot something i miss
[21:44:04] <robh>	 --verbose tells nothing ;]
[21:44:12] <paladox>	 --debug
[21:44:16] <robh>	 ok, killed my run
[21:44:34] <robh>	 bblack: i assume you are checking it or should i try with debug?
[21:44:39] <bblack>	 yeah I'm checking
[21:45:02] <bblack>	 wow, puppet agent does some amusingly pointless and inefficient crap when observing startup via strace :)
[21:49:10] <robh>	 https://phabricator.wikimedia.org/T185228 is assigned to you now brandon i think we're at the point for traffic handoff (other than cp4024)
[21:49:30] <robh>	 let me know if we need to stick around, otherwise id like to head out to beat rush hour =]
[21:49:38] <bblack>	 so, nothing's really broken at the software level on cp4024 I don't think, per se.  It's just taking a very long time to communicate with the puppemaster...
[21:50:13] <bblack>	 and I'm observing an error rate of ~10% on eth0
[21:50:37] <bblack>	 (as in, on cp4024's eth0, RX errors/packets in  RX packets:228926940811 errors:885462 dropped:827743 overruns:0 frame:885462
[21:50:40] <bblack>	 )
[21:51:03] <bblack>	 if you just look at the increase now.  most of the RX packets are from before the downtime.
[21:51:13] <bblack>	 so probably we have a bad connection to the new switch there
[21:52:31] <bblack>	 I observe similar ballpark-10% loss rates with ping over time as well (cp4024->bast4002)
[21:52:40] <bblack>	 33/30 packets, 9% loss, min/avg/ewma/max = 0.071/0.143/0.133/0.210 ms
[21:52:44] <robh>	 so we need to swap the fiber?
[21:52:54] <robh>	 seems liek first step for bad packet loss....
[21:52:57] <bblack>	 I have no idea what's actually-wrong, I just know I'm observing network errors
[21:53:06] <robh>	 more likely a cracked fiber optic than anything port specific to the new switch imo
[21:54:08] <bblack>	 what's sad is a lot of other stuff works fairly well with 10% loss, but puppet is so inefficient and retarded that 10% may as well be 100% :P
[21:55:09] <robh>	 arzhel is comparing network stack diagnostics between ports.
[21:55:17] <robh>	 ideally we can see something there so we go swap cable and see
[21:55:25] <bblack>	 (also, you wouldn't believe the number of times a puppet agent run ends up doing a for-loop over all the 16-bit integers and calling the close() syscall on them all :P)
[21:55:40] <XioNoX>	 light level are similar to other interfaces, and no errors on the switch port
[21:55:52] <robh>	 bblack: are you seeing that loss via os?
[21:56:03] <bblack>	 yes
[21:56:07] <bblack>	 cp4025: 138/138 packets, 0% loss, min/avg/ewma/max = 0.072/0.126/0.140/0.208 ms
[21:56:18] <robh>	 ok, we'll go swap the fiber and you can retest?
[21:56:21] <bblack>	 cp4024: 246/224 packets, 8% loss, min/avg/ewma/max = 0.071/0.132/0.142/0.212 ms
[21:56:35] <bblack>	 yes, I can re-test
[21:56:40] <robh>	 swapping fiber and optics is al i can think to try
[21:56:51] <bblack>	 the errors are likely unidirectional if switch doesn't see an issue
[22:01:38] <XioNoX>	 bblack: optic changed on the switch side, can you test?
[22:05:04] <robh>	 no loss when i ping from it
[22:05:10] <robh>	 but i didnt ping before so i didnt see the loss firsthand
[22:05:18] <bblack>	 yeah I'm gathering data now
[22:05:33] <robh>	 puppet works
[22:05:35] <robh>	 woooooo
[22:05:36] <robh>	 bad optic
[22:05:39] <robh>	 im throwing it away.
[22:05:56] <robh>	 im watching puppet apply config
[22:05:57] <bblack>	 ok
[22:06:03] <robh>	 and its good
[22:06:12] <robh>	 old optic trashed.
[22:06:22] <bblack>	 looks good from here, I don't observe bad error rates
[22:06:29] <robh>	 now that they are only 35 bucks a pop its less painful.  though the one i just threw away was 115
[22:06:51] <robh>	 so the puppet failure will clear shortly for icinga
[22:06:53] <robh>	 since its run
[22:07:20] <robh>	 I think that means we're done on-site?
[22:07:50] * robh hasnt seen icinga clear it yet but watched puppet run directly
[22:08:10] <robh>	 bblack: had notice in puppet run Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Notify[tlsproxy localssl default_server]/message: defined 'message' as 'tlsproxy::localssl instance unified with server name www.wikimedia.org is the default server.' 
[22:08:25] <robh>	 likey known but i wasnt sure so i mention it.
[22:10:54] <bblack>	 yeah it's normal
[22:10:56] <robh>	 all hosts are in hte clear
[22:10:59] <robh>	 so we're outta here =]
[22:11:02] <bblack>	 bye!
[22:11:05] <robh>	 back online from homes shortly =]
[22:45:06] <wikibugs>	 10Varnish: varnishkafka fails to build on Alpine Linux (strndupa) - https://phabricator.wikimedia.org/T186169#3935592 (10Jrdnch)
[22:55:41] <XioNoX>	 saw the re-pooling change, I'm online and keeping an eye on the new switches
[23:01:49] <bblack>	 I'm still poking at a few things, but close to repool merge now
[23:12:44] <bblack>	 going ok so far
[23:13:18] <bblack>	 the hitrates are a little off because the varnish frontends all crashed out and lost their contents during the mass confctl depooling, but it's not a major issue :)
[23:15:45] <wikibugs>	 10netops, 10DBA, 10Operations, 10ops-codfw: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3935719 (10Papaul) p:05Triage>03Normal
[23:19:34] <robh>	 yeah i should have waited for a reply on the depool a few minutes longer ;P
[23:23:08] <bblack>	 anyways, traffic level seems to have stabilized at a reasonable volume.  hitrates are still coming up a bit.  I'm going to let them settle into numbers that are a bit higher, before rolling restarts to wipe the caches to get past the lost invalidations.
[23:23:21] <bblack>	 (seems less disruptive, so long as nobody's actively complaining about stale content)
[23:36:57] <XioNoX>	 Switch is behaving as it should
[23:37:11] <XioNoX>	 And we should not see the "Processor usage over 85%" alerts anymore
[23:38:39] <bblack>	 yay :)
[23:59:35] <wikibugs>	 10netops, 10DBA, 10Operations, 10ops-codfw: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3935719 (10ayounsi) Interface description added, port up and in the private vlan. No MAC seen on the switch side so far.