[00:20:15] 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933087 (10Krinkle) [00:29:14] 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933117 (10Krinkle) [00:31:20] 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933120 (10Krinkle) [00:33:44] 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933134 (10Krinkle) [00:36:22] 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#1810236 (10Krinkle) [00:37:55] 10Traffic, 10Operations, 10TechCom-RFC, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3933165 (10Krinkle) [00:51:15] 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3933246 (10mobrovac) [02:36:08] 10Traffic, 10Operations, 10RESTBase, 10RESTBase-API, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3933362 (10Krinkle) [03:12:20] 10Traffic, 10Operations, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3933395 (10Krinkle) [03:13:09] 10Traffic, 10Operations, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3786217 (10Krinkle) [06:22:26] hello people, just restarted the varnish backend on cp4024 since it was causing 503s [06:30:39] oh, did you. thank you. i was looking but wasnt sure if i should do just that. next time i will. i saw it was limited to ulsfo upload but not that cp4024 specifically was the root source [06:32:57] thanks and recovery confirmed. out again [15:58:42] bblack: how soon before we go onsite to replace ulsfo switches do we need to depool the site? [15:58:50] im guessing over an hour due to TTL [15:58:52] ? [16:08:28] robh: yeah, preferably. Remind me again the start time and rough guestimate of the network downtime? [16:09:20] im driving down at 930, but dont expect to get there until 10 [16:09:31] then XioNoX and i estimate 2-3 hours if things go well. [16:09:44] or all day if things go terribly, but its likely a minimum of 3 hours. [16:09:50] https://gerrit.wikimedia.org/r/#/c/407022/ [16:14:31] robh: yeah I'd push it now, and !log it too [16:15:15] so ive done this before but can someone else santiy check +1 my patch? [16:15:32] done! [16:15:33] thx =] [16:16:32] ok, pushed and ulsfo is now geodns depooled. [16:16:48] and of course we'll check inbefore we start yanking shit out of rack ;D [16:17:05] XioNoX: Get that bike ready! ;D [16:17:26] I'll be faster than you in traffic :) [16:21:26] lol [16:35:49] by quite a bit yeah [16:36:05] pretty sure your biking there is largely unaffected by street traffic ;] [17:43:58] 10Traffic, 10Operations, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#3934617 (10BBlack) TL;DR: The network itself doesn't seem to be at fault. Whatever this is, it probably affects esams more than othe... [18:40:33] bblack: so im about to unplug ulsfo ssystems [18:40:44] should i shut them down or anything to help them recover or just leave them online? [18:40:50] other than manually depool that is [18:41:06] they're going to remain powered up and online in OS unless you specify otherwise =] [18:43:20] I assume i need to depool them one by one [18:47:45] 10Traffic, 10netops, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3934835 (10RobH) [18:47:53] Well, thats what I did so if I was wrong let me know! [18:48:13] since they are etting unplugged seemed safer to set to manual depool so no auto repooling happens... [18:50:45] see -operations i just caused pager storm fora depooled site [18:54:12] bblack: Ok, so now they are all back to manually pooled [18:54:18] however im about to start pulling their network connections [18:54:33] so, before more pager storms ensue [18:54:39] So what exactly should we maint in icinga? [18:54:50] iver already set all the cp systems themselves and all services to maint. [18:54:54] now that the pooling state is correct, just make sure icinga downtimes are set for: all ulsfo hosts, and all ulsfo LVS checks [18:55:29] ok, ill add lvs checks into maint now [18:56:23] robh: "since they are getting unplugged"? from power? [18:57:06] hmm no, says "remain powered up" a few lines above that, I guess you mean from the switch. [18:57:39] either way, the confctl per-host depooling is for isolated cases. We never mass-depool a whole site's cp servers out of confctl. It doesn't work and causes unecessary pain. [18:58:08] just unplugged from switch [18:58:12] ok [18:58:50] which im about to start doing now if thats ok? i set maint mode for the text and upload.ulsfo.wikimedia.org [18:59:01] which i missed before =P [18:59:51] * robh is standing by that its ok [19:01:38] the LVS checks look ok. You probably want to downtime the necessary network devices, too. [19:02:09] asw-ulsfo at least I imagine, but not the routers or oob? maybe also ripe-atlas [19:03:09] true, will do now [19:03:10] also, we could avoid ipsec spam by downtiming all of those for basically-everything except esams ipsec hosts [19:03:25] you wanna downtime the ipsec stuff then? [19:03:30] yeah I'll poke at it [19:04:17] it's a PITA, as there's no easy regex search for what to downtime for ipsec heh [19:05:21] im downtiming the access switches, oob and atlas [19:06:45] ipsec downtimes done [19:06:46] ok done [19:07:27] is mr1 affected in general? or just mr1.oob? [19:08:29] either way, I think that's the only one I see where I don't know for sure [19:08:44] ill just set it all [19:09:15] i set oob and its the other one technically yeah [19:10:21] ok [19:11:14] I downtimed the ulsfo-specific 5xx rate checker too, although honestly if that goes off, probably one of the broader all-sites ones will go off pointlessly as well, which we can't disable. the checks just aren't ideal. [19:11:24] robh: lgtm [19:12:08] robh: apparently not all ulsfo hosts downtimed :) [19:12:11] I'm monitoring this channel.If something goes wrong let me know [19:12:35] well shit i forgot lvs [19:12:38] sorry, still sick =P [19:12:46] at least those hosts dont page [19:13:15] and dns. downtiming all those [19:14:18] anyways, keep going [19:15:24] TODO: maybe there should be some kind of virtual whole-datacenter object in icinga that can be downtimed to suppress everything in that DC, and all the other things depend on it (just to be used for scenarios like these) [20:29:40] 10Traffic, 10Operations: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968#3929788 (10BBlack) In both cases the child was killed with signal 9 by the kernel oom-killer. It may be the case that our memory cache sizing is very tight in general, and that overheads have increased... [20:33:33] fyi: on ulsfo switch swap. new switches are racked and all wired up and arzhel is working on config stuff now [20:38:31] config is all done [20:38:46] working on dns monitoring now [20:42:38] dns monitoring? [20:42:49] oh, I see now, ignore that question! [20:43:25] I'll deal with invalidation/repool stuff once everything's up and going. You can leave it running-but-dns-depooled at that point. [20:46:12] (my current thinking on the invalidation stuff is probably to repool the site with cache contents as-is, and then do some rolling cache wipes (via daemon restarts) over a reasonable timeframe to get past the missed-invalidation problems) [21:06:21] 10Traffic, 10Operations, 10Wikimedia-Site-requests: oudated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#3935268 (10bd808) [21:19:16] removing scheduled downtimes in icinga for the hosts in ulsfo [21:19:24] lvs cp and mr1 removed [21:19:32] not touching asw since itll be a new asw hostname [21:20:20] Can someone check if icinga is working fine? I renamed asw-ulsfo to asw2-ulsfo, and the first puppet run showed an issue when reloading puppet, 2nd one was all fine [21:21:13] Total Errors: 1 [21:21:23] Error: 'asw-ulsfo' is not a valid parent for host 'cp4021' ( [21:21:41] puppet will not restart it when the config check fails [21:21:50] so it's not broken but it would be on restart [21:22:38] mutante: the run on einsteinium showed: [21:22:38] - parents asw-ulsfo [21:22:38] + parents asw2-ulsfo [21:23:01] so it's doing the rename at some point [21:23:45] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: replace ulsfo access switches - https://phabricator.wikimedia.org/T185228#3935345 (10RobH) [21:24:18] XioNoX: i'll try to fix it and see if puppet reverts [21:24:38] mutante: that's the change I made: https://gerrit.wikimedia.org/r/#/c/407059/1/modules/netops/manifests/monitoring.pp [21:25:33] i think maybe puppet has to run on all the cp hosts.. and then on einsteinium too [21:25:37] checking [21:25:59] it did it for some hosts but not for cp4021.. running puppet on both now [21:26:26] - parents asw2-ulsfo [21:26:27] + parents asw-ulsfo [21:26:34] it reverts me actively... [21:28:42] \o/ [21:30:47] that wasn't a good thing so far:) but now it did the opposite after i ran puppet on cp4021 and einsteinium again [21:32:32] XioNoX: fixed now. Total Errors: 0 [21:32:45] and it didn't re-break it on next run so far [21:33:17] cool [21:33:22] mutante: so what's the proper order? [21:33:40] XioNoX: puppet run on all cp* hosts and then on einsteinium [21:33:40] host server then einstiem to clear seems like? [21:33:43] cool [21:33:48] ok, thanks [21:33:56] and the check to get more info: [21:33:56] [einsteinium:~] $ sudo icinga -v /etc/icinga/icinga.cfg [21:34:03] anyone running on 4024 yet? [21:34:31] i am. [21:35:50] cp4024 is just sitting on loading facts =P [21:36:13] that's also the one that needed to be kicked yesterday, heh [21:36:15] its remotely accessible so its not an onsite issue, even thgouh its caused yb onsite work... [21:36:20] fuck typos [21:36:55] mutante: kicked as in reboot? [21:37:45] robh: as in "restart varnish backend" it caused 5xx before and then it was fixed after elukey did that [21:37:45] no hardare failure events in sel [21:37:56] oh, well its still sitting on laoding facts [21:37:59] so something is fucked up. [21:38:17] lets restart it and try again... [21:38:24] restart the puppet run that is. [21:38:37] bleh, so far same issue. [21:42:05] Not sure whats up with it. [21:42:41] so [21:43:05] scrolling back re: icinga, I'm pretty sure the magical dependencies on e.g. asw-ulsfo from various $hosts comes from their lldpd-based $facts [21:43:33] so they should fix themselves after a cycle of: all the ulsfo hosts running puppet agent, running icinga again on the icinga master(s) [21:43:39] So the only host that seems in a bad state is cp4024 [21:43:45] which is stuck on loading facts in puppet each attempted run [21:43:49] checking [21:43:53] ill cancel my run out [21:44:00] so you can try it and see if you spot something i miss [21:44:04] --verbose tells nothing ;] [21:44:12] --debug [21:44:16] ok, killed my run [21:44:34] bblack: i assume you are checking it or should i try with debug? [21:44:39] yeah I'm checking [21:45:02] wow, puppet agent does some amusingly pointless and inefficient crap when observing startup via strace :) [21:49:10] https://phabricator.wikimedia.org/T185228 is assigned to you now brandon i think we're at the point for traffic handoff (other than cp4024) [21:49:30] let me know if we need to stick around, otherwise id like to head out to beat rush hour =] [21:49:38] so, nothing's really broken at the software level on cp4024 I don't think, per se. It's just taking a very long time to communicate with the puppemaster... [21:50:13] and I'm observing an error rate of ~10% on eth0 [21:50:37] (as in, on cp4024's eth0, RX errors/packets in RX packets:228926940811 errors:885462 dropped:827743 overruns:0 frame:885462 [21:50:40] ) [21:51:03] if you just look at the increase now. most of the RX packets are from before the downtime. [21:51:13] so probably we have a bad connection to the new switch there [21:52:31] I observe similar ballpark-10% loss rates with ping over time as well (cp4024->bast4002) [21:52:40] 33/30 packets, 9% loss, min/avg/ewma/max = 0.071/0.143/0.133/0.210 ms [21:52:44] so we need to swap the fiber? [21:52:54] seems liek first step for bad packet loss.... [21:52:57] I have no idea what's actually-wrong, I just know I'm observing network errors [21:53:06] more likely a cracked fiber optic than anything port specific to the new switch imo [21:54:08] what's sad is a lot of other stuff works fairly well with 10% loss, but puppet is so inefficient and retarded that 10% may as well be 100% :P [21:55:09] arzhel is comparing network stack diagnostics between ports. [21:55:17] ideally we can see something there so we go swap cable and see [21:55:25] (also, you wouldn't believe the number of times a puppet agent run ends up doing a for-loop over all the 16-bit integers and calling the close() syscall on them all :P) [21:55:40] light level are similar to other interfaces, and no errors on the switch port [21:55:52] bblack: are you seeing that loss via os? [21:56:03] yes [21:56:07] cp4025: 138/138 packets, 0% loss, min/avg/ewma/max = 0.072/0.126/0.140/0.208 ms [21:56:18] ok, we'll go swap the fiber and you can retest? [21:56:21] cp4024: 246/224 packets, 8% loss, min/avg/ewma/max = 0.071/0.132/0.142/0.212 ms [21:56:35] yes, I can re-test [21:56:40] swapping fiber and optics is al i can think to try [21:56:51] the errors are likely unidirectional if switch doesn't see an issue [22:01:38] bblack: optic changed on the switch side, can you test? [22:05:04] no loss when i ping from it [22:05:10] but i didnt ping before so i didnt see the loss firsthand [22:05:18] yeah I'm gathering data now [22:05:33] puppet works [22:05:35] woooooo [22:05:36] bad optic [22:05:39] im throwing it away. [22:05:56] im watching puppet apply config [22:05:57] ok [22:06:03] and its good [22:06:12] old optic trashed. [22:06:22] looks good from here, I don't observe bad error rates [22:06:29] now that they are only 35 bucks a pop its less painful. though the one i just threw away was 115 [22:06:51] so the puppet failure will clear shortly for icinga [22:06:53] since its run [22:07:20] I think that means we're done on-site? [22:07:50] * robh hasnt seen icinga clear it yet but watched puppet run directly [22:08:10] bblack: had notice in puppet run Notice: /Stage[main]/Profile::Cache::Ssl::Unified/Tlsproxy::Localssl[unified]/Notify[tlsproxy localssl default_server]/message: defined 'message' as 'tlsproxy::localssl instance unified with server name www.wikimedia.org is the default server.' [22:08:25] likey known but i wasnt sure so i mention it. [22:10:54] yeah it's normal [22:10:56] all hosts are in hte clear [22:10:59] so we're outta here =] [22:11:02] bye! [22:11:05] back online from homes shortly =] [22:45:06] 10Varnish: varnishkafka fails to build on Alpine Linux (strndupa) - https://phabricator.wikimedia.org/T186169#3935592 (10Jrdnch) [22:55:41] saw the re-pooling change, I'm online and keeping an eye on the new switches [23:01:49] I'm still poking at a few things, but close to repool merge now [23:12:44] going ok so far [23:13:18] the hitrates are a little off because the varnish frontends all crashed out and lost their contents during the mass confctl depooling, but it's not a major issue :) [23:15:45] 10netops, 10DBA, 10Operations, 10ops-codfw: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3935719 (10Papaul) p:05Triage>03Normal [23:19:34] yeah i should have waited for a reply on the depool a few minutes longer ;P [23:23:08] anyways, traffic level seems to have stabilized at a reasonable volume. hitrates are still coming up a bit. I'm going to let them settle into numbers that are a bit higher, before rolling restarts to wipe the caches to get past the lost invalidations. [23:23:21] (seems less disruptive, so long as nobody's actively complaining about stale content) [23:36:57] Switch is behaving as it should [23:37:11] And we should not see the "Processor usage over 85%" alerts anymore [23:38:39] yay :) [23:59:35] 10netops, 10DBA, 10Operations, 10ops-codfw: switch port configuration for tendril2001 - https://phabricator.wikimedia.org/T186172#3935719 (10ayounsi) Interface description added, port up and in the private vlan. No MAC seen on the switch side so far.