[07:50:24] 10netops, 06Operations: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2964243 (10Dzahn) [10:44:27] bblack: codfw backends are reasonably warm now https://grafana.wikimedia.org/dashboard/db/prometheus-varnish-dc-stats?panelId=21&fullscreen&var-datasource=codfw%20prometheus%2Fops&var-cluster=cache_upload&var-layer=backend&from=1483635805857&to=1485254439147 [10:45:03] ok to repool? https://gerrit.wikimedia.org/r/#/c/333854/ [10:48:51] 10Traffic, 10Monitoring, 06Operations, 07Wikimedia-Incident: Plot number of cached objects on a per-server per-DC basis - https://phabricator.wikimedia.org/T154864#2964613 (10ema) 05Open>03Resolved @fgiunchedi added per-host stats as well: https://grafana.wikimedia.org/dashboard/db/varnish-machine-sta... [11:00:28] we can wait for tomorrow ~10ish AM UTC which is when codfw is less busy of course [12:21:45] 10netops, 06Operations: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2964809 (10akosiaris) Done. Now esams+eqiad use install1001 as DHCP server and ulsfo+codfw use install2001 as DHCP server. [12:27:51] 10netops, 06Operations: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2964871 (10akosiaris) @dzahn, I think that part is done, please do some tests and then we can resolve [12:31:24] 10netops, 10DBA, 06Labs, 06Operations, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964878 (10Marostegui) [12:31:48] 10netops, 10DBA, 06Labs, 06Operations, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2961118 (10Marostegui) [12:51:16] 10netops, 10DBA, 06Labs, 06Operations, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2964931 (10Marostegui) [13:10:14] ema, bblack: I have upgraded cp1008 to the most recent 4.4 kernel image (based on 4.4.39), the kernel is also running on some other servers in production, so if you find the time it would be good to upgrade the LVS/CP hosts to that [13:11:20] in a few weeks I'll build new 4.9 images, which can then slowly supercede our 4.4 jessie kernel [13:15:13] moritzm: nice, thanks! [13:15:53] moritzm: I'll start the upgrades later today or tomorrow [13:16:56] nice, maybe start with a single cp* host initially and have a look for a day, the current deployments of wmf8 are mostly smaller systems recently installed/upgraded [13:19:49] ok I'll start with one upload and one text machine, if they don't blow up we're good with misc and maps too :) [13:33:31] 10netops, 10DBA, 06Labs, 06Operations, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965022 (10Marostegui) [14:10:31] 10netops, 10DBA, 06Labs, 06Operations, 13Patch-For-Review: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999#2965171 (10Marostegui) [14:36:22] ema: yes [14:56:22] 10netops, 06Labs, 06Operations: asw-c2-eqiad reboots & fdb_mac_entry_mc_set() issues - https://phabricator.wikimedia.org/T155875#2965288 (10Cmjohnson) added a secondary switch, asw2-c2-eqiad. accessible via scs port 48 [15:31:07] didn't know about https://github.com/quicwg/base-drafts ! [16:07:58] moritzm: oh and we should also upgrade to jessie 8.7 I guess [16:10:29] we could s/8.6/8.7/ in T146011 and save the environment by not creating a new task! [16:10:30] T146011: Integrate jessie 8.6 point release - https://phabricator.wikimedia.org/T146011 [16:23:37] 10netops, 06Operations: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#2965652 (10faidon) Just in: > Engineering has fixed PR 1238906 has been fixed through master PR 1205416, and the fix would be available 14.1X53-D42 onwards, sc... [16:44:15] ema: half done for 8.7: https://phabricator.wikimedia.org/T155401 [16:44:49] but feel free to dist-upgrade the remaining one before reboot, the systemd uodate is still missing [16:45:15] I'll deal with the remaining half cluster-wide this and maybe next week [16:45:34] oooh cool! I've phab searched for 8.7 and it didn't find a thing [16:46:00] moritzm: can we close T146011 then? [16:46:00] T146011: Integrate jessie 8.6 point release - https://phabricator.wikimedia.org/T146011 [16:47:10] oh, I never updated the task, I'll transfer what I have in my tracking emacs buffer later on, it's mostly done, but a few packages are not rolled out cluster-wide [17:21:16] 10netops, 06Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2965816 (10RobH) 05Resolved>03Open I'm reopening this. LVS4002 had its power supply fail again, the exact same PSU slot that died before, PSU2. I had taken another power supply out of cp4012 an... [17:29:10] 10netops, 06Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2965860 (10RobH) So when we get the replacement power supplies mentioned on T156154, we should move the power ports used by lvs4002 with another system. Then if the other system has a psu failure, w... [17:39:36] <_joe_> is there a way to make nginx have an unlimited proxy_read_timeout? [17:41:40] I can check the source [17:43:08] looks like possibly -1 [17:43:10] but I would test that [17:43:17] that's the value for "unset" [17:43:22] (for timers in nginx in general) [17:44:21] <_joe_> ok thanks :) [17:44:36] <_joe_> I hoped your knowledge would make me avoid having to search that [17:44:51] <_joe_> basically I'm trying to use nginx as an auth proxy for etcd [17:45:06] <_joe_> as etcd's own auth mechanism is horribly expensive [17:45:10] <_joe_> in terms of performance [17:45:31] if that doesn't work, just set it to weeks or something [17:45:48] <_joe_> yup, will do [17:46:51] <_joe_> bblack: did you have a chance to take a look at envoy, btw? https://lyft.github.io/envoy/ [17:48:23] <_joe_> I think it's pretty interesting, I kinda asked Ryan if he or some colleague of his would like to have a chat with us about it, if there is some interest around [17:48:58] <_joe_> I discovered it as google is planning to use it as a building block for an app "router" inside kubernetes [17:53:09] <_joe_> (-1 doesn't work, btw, and some old thread suggests the max is 24 days, I just set it to 365d, let's see what happens) [18:11:17] 10netops, 06Operations, 10ops-ulsfo: lvs4002 power supply failure - https://phabricator.wikimedia.org/T151273#2965943 (10RobH) I asked Chris if we had any decommissioned R620s in eqiad so we can steal power supplies, but we do not. >>! In T156154#2965882, @Cmjohnson wrote: > We do not have any decommissione... [19:11:16] 10netops, 06Operations: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2966096 (10Dzahn) @akosiaris Thank you! I have reinstalled planet2001 using install2001 and it worked fine. I will do some more tests for eqiad soon. [23:46:17] 10netops, 06Operations: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2967330 (10Dzahn) I also tested with prometheus1003 if the installer starts. It does.. (fails later at grub install but not related to this here). [23:58:20] 10netops, 06Operations: netops: switch all subnets to use install1001/2001 as DHCP - https://phabricator.wikimedia.org/T156109#2967348 (10Dzahn) 05Open>03Resolved finally tested with analytics1015 (unused spare system), installed trusty image from install1001. services on carbon were down too. resolving now