[09:15:35] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp3045.esams.wmnet', 'cp4026.ulsfo.wmnet', 'cp5001.eqsin.wmnet'] ``` The log c... [10:02:40] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4026.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4026.ulsfo.wmnet'] ``` [10:06:55] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp4027.ulsfo.wmnet', 'cp3042.esams.wmnet', 'cp2007.codfw.wmnet'] ``` The log c... [10:39:28] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4027.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4027.ulsfo.wmnet'] ``` [11:49:58] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp2008.codfw.wmnet', 'cp2023.codfw.wmnet', 'cp3047.esams.wmnet'] ``` The log c... [11:55:46] 10Traffic, 10Operations: Traffic Server - Prometheus integration - https://phabricator.wikimedia.org/T202381 (10ema) [11:55:55] 10Traffic, 10Operations: Traffic Server - Prometheus integration - https://phabricator.wikimedia.org/T202381 (10ema) p:05Triage>03Normal [11:57:10] 10Traffic, 10Operations: Traffic Server - Prometheus integration - https://phabricator.wikimedia.org/T202381 (10ema) [12:21:08] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2008.codfw.wmnet', 'cp2023.codfw.wmnet', 'cp3047.esams.wmnet'] ``` and were **ALL** successful. [12:56:40] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp3036.esams.wmnet', 'cp4028.ulsfo.wmnet', 'cp2016.codfw.wmnet'] ``` The log c... [13:27:58] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4028.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4028.ulsfo.wmnet'] ``` [13:44:30] volans: I don't see much in the logs other than: [13:44:31] 2018-08-21 13:21:02 [INFO] (ema) wmf-auto-reimage::print_line: WARNING: failed to downtime host on Icinga, wmf-downtime-host returned 2 [13:48:21] ema: did you check the cumin log? [13:49:51] ema: which host? [13:51:05] I guess cp4028 [13:51:11] volans: cp4028, correct [13:51:46] the cumin log is really hard to read alas [13:51:46] ok so that downtime is run in a background subprocess [13:52:04] but I couldn't see anything interesting there [13:52:11] to overcome the races in the first puppet run hence why is not there [13:52:21] ah! [13:52:50] and now I'm wondering if that log gets lost [13:53:01] or you just get it on stdout [13:53:45] BTW if you rgrep 'WARNING: failed to downtime host on Icinga' on neodymium:/var/log/wmf-auto-reimage/ you'll find multiple occurrences of this [13:54:10] I'm leaning to think it's the race we already talked the other day/week [13:54:32] multiple hosts in parallel, trying to downtime and being puppet so slow on the icinga host, the run-puppet-agent times out [13:55:03] yeah I don't think I've ever seen this when upgrading a single host [13:55:17] I need to change the behaviour so that when done in parallel it tries to do it only after the last one [13:55:20] if they are taking the same time [13:55:38] but I need to take into account also cases in which the 2 hosts are taking different times and I need to do it for both [13:56:00] it's not a one-line fix and I don't have enough time to look into it right now [13:56:11] I guess it will have to wait after the switch to codfw [13:56:26] unless someone else wants to tackle it ;) [13:56:45] * ema hides [14:25:43] 10netops, 10Operations, 10ops-codfw: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10RobH) [14:26:08] 10netops, 10Operations, 10ops-codfw: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10RobH) [14:26:30] 10netops, 10Operations, 10ops-codfw: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10RobH) a:03Papaul [15:13:11] so.. I already got the first complete integration test... certcentral requesting the certificate, saving the http-01 challenge on disk, and a simple http.server answering the http-01 challenge from pebble [15:22:08] yay [15:34:06] 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, 10Reading-Infrastructure-Team-Backlog: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Mholloway) As discussed today, let's gradually make these shorter (e.g., 24h, 12h, 6h, 3h, 1h) and closely monitor the effect... [15:55:03] hmmm funny [15:55:24] once the ACME server marks the challenge as invalid, it cannot be resumed [15:55:34] so you need to get another set of challenges [15:56:21] so certcentral should verify that the challenges are being satistified before asking the ACME server to solve the challenge [15:56:42] yes [16:11:54] ack, till that's implemented we will be a little bit more aggresive and we just restart the whole process O:) [16:12:17] alright [16:24:36] 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, 10Reading-Infrastructure-Team-Backlog: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10BBlack) Catching up a little here: what I'm seeing right now on tile images, from the public POV, is basically: `Cache-contr... [16:30:04] 10netops, 10Operations, 10ops-eqiad: Move asw2-a<->cr1 uplink back to asw-a - https://phabricator.wikimedia.org/T202075 (10Cmjohnson) connections are complete. I was able to use the same cables with the exception of these two xe-3/0/0 xe-2/0/44 xe-1/1/0 4776 is now cable number 2172 xe-4/1/0 xe-7/0/45 xe-8... [16:39:45] 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, 10Reading-Infrastructure-Team-Backlog: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Gehel) >>! In T186732#4519685, @BBlack wrote: > Reducing the Varnish-level TTLs seems counter-productive for efficiency at al... [16:50:52] 10netops, 10Operations, 10ops-eqiad: Move asw2-a<->cr1 uplink back to asw-a - https://phabricator.wikimedia.org/T202075 (10ayounsi) 05Open>03Resolved a:03ayounsi [16:51:37] 10netops, 10Operations, 10ops-eqiad: Move asw2-a<->cr1 uplink back to asw-a - https://phabricator.wikimedia.org/T202075 (10ayounsi) Done, not a single ping was missed to a asw canary host (dns1001). [16:56:31] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10Cmjohnson) [16:56:39] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10Cmjohnson) 05Open>03Resolved [17:02:32] 10netops, 10Operations, 10ops-codfw: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10Papaul) [17:02:45] 10netops, 10Operations, 10ops-codfw: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10Papaul) 05Open>03Resolved [17:18:41] 10netops, 10Operations, 10ops-eqdfw: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) [17:44:39] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) p:05Triage>03High [17:44:53] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) [17:46:18] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) [17:51:25] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) [17:52:54] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10Reedy) [18:07:41] 10Traffic, 10netops, 10Operations, 10ops-ulsfo, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10RobH) [18:07:44] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) [18:08:19] 10Traffic, 10Operations, 10ops-ulsfo: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10RobH) [18:08:22] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) [18:08:57] 10Traffic, 10netops, 10Operations, 10ops-ulsfo: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) so bast4002 wasn't fully deployed as a bastion yet, it can go online in new site in advance of other systems (since its not in production) if that is useful. [21:18:04] 10Traffic, 10Operations, 10monitoring: False alarms on varnish-http-requests 70% GET drop in 30 min alert - https://phabricator.wikimedia.org/T201630 (10ayounsi) The main goal of that alert is to be notified if a site suddenly sees its traffic drop, from a network or other issue, but isn't 100% unreachable (...