[00:00:00] but it's in there [00:00:10] meh [00:00:13] so icinga has the other 'rows' as single switch stacks [00:00:21] but the two new rows use singular leafs on spines [00:00:28] i suspect this is an issue we didnt account for in icinga [00:00:34] aha [00:00:37] if i search for lsw in icinga for monitoring i see nothing [00:01:13] so, im not sure how these are going to be monitored in icinga... not sure if inetops knows [00:01:19] netops even [00:01:34] but... i dont wanna leave this broken, and dont wanna undo it or both moritz cannot help me troubleshoot raid drivers [00:01:43] and then without it in icinga as error hard for them to see it [00:01:46] =/ [00:01:50] my timezone sucks heh [00:01:57] yea.. thinking how to do it best [00:02:01] this was just a test install? [00:02:17] yeah, but i dont wanna decom and remove its network details if i can help it [00:02:37] i guess i may have to, or just maybe make a task to unbreak now icinga on the new row support? [00:02:41] woudl that go to netops i guess... [00:03:02] but leaving it unable to restart seems not good... i guess better to document it all and then rollback for the evening [00:03:12] more likely to observability or both [00:03:36] let me look though for 5 more min [00:04:37] 10SRE, 10DC-Ops, 10observability: icinga config error for new rows E/R - https://phabricator.wikimedia.org/T302940 (10RobH) [00:04:40] cool, i made a task but its assign to me for now [00:04:57] one option is to disable puppet, manually remove the host from icinga config [00:05:04] then it will alert about puppet disabled [00:05:09] but it won't break [00:05:19] that is if you don't want to decom the host [00:05:42] oh, can i remove from icinga via the https ui? [00:05:46] trying to see though if I find the actual switches [00:05:58] no, that would be via shell [00:06:18] oh yeah, ok, then maybe better to decom.. hrmm.. [00:06:47] objects/puppet_hosts.cfg: parents lsw1-f1-eqiad.mgmt.eqiad.wmnet [00:06:50] objects/puppet_hosts.cfg: parents lsw1-e3-eqiad.mgmt.eqiad.wmnet [00:12:55] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [00:15:20] (03PS1) 10Cwhite: opensearch: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/767607 (https://phabricator.wikimedia.org/T276198) [00:21:06] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts dumpsdata1007.eqiad.wmnet [00:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:48] (03PS1) 10Cwhite: aptrepo: update grafana version to <8.4 [puppet] - 10https://gerrit.wikimedia.org/r/767608 (https://phabricator.wikimedia.org/T282863) [00:24:58] (03CR) 10Zabe: [C: 03+1] "(when wmf.24 is rolled out)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767596 (https://phabricator.wikimedia.org/T302907) (owner: 10Reedy) [00:25:03] !log robh@cumin1001 START - Cookbook sre.dns.netbox [00:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:41] icinga config is ok again after that decom [00:30:11] rescheduling [00:30:33] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [00:30:50] robh: ^ nothing to worry now [00:31:00] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:31:00] that ticket is still legit of course [00:31:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:08] 10SRE, 10DC-Ops, 10observability: icinga config error for new rows E/R - https://phabricator.wikimedia.org/T302940 (10RobH) So I had to decom the host overnight so I wouldn't leave icinga broken. However, not sure how to add lsw1-f1-eqiad.mgmt.eqiad.wmnet so it works like lsw1-e3-eqiad.mgmt.eqiad.wmnet [00:31:19] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts dumpsdata1007.eqiad.wmnet [00:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:23] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `dumpsdata1007.eqiad.wmnet` - dumpsdata1007.eqiad.wmnet (**WARN**) - //Host not found on Icinga, unable to downtime it// - F... [00:31:37] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops, 10observability: icinga config error for new rows E/R - https://phabricator.wikimedia.org/T302940 (10RobH) [00:31:53] somehow the "lsw1-e3-eqiad.mgmt.eqiad.wmnet" is ok but the other new one is not [00:34:40] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops, 10observability: icinga config error for new rows E/R - https://phabricator.wikimedia.org/T302940 (10RobH) Failed to run Homer on lsw1-f1-eqiad.mgmt.eqiad.wmnet: Command '['/usr/local/bin/homer', 'lsw1-f1-eqiad.mgmt.eqiad.wmnet', 'commit', 'Ho... [00:39:26] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops, 10observability: icinga config error for new rows E/R - https://phabricator.wikimedia.org/T302940 (10Dzahn) When this host was installed and added to Icinga config by puppet, it broke Icinga config. The error was: ` Error: 'lsw1-f1-eqiad.mgmt.eqi... [00:52:55] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [00:56:55] PROBLEM - WDQS high update lag on wdqs1004 is CRITICAL: 4.816e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [00:57:55] (ProbeHttpFailed) firing: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [01:00:04] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220303T0100). [01:09:31] RECOVERY - Disk space on centrallog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [01:21:31] (03PS1) 10Razzi: analytics_cluster::datahub::opensearch: add firewall and base_checks [puppet] - 10https://gerrit.wikimedia.org/r/767611 (https://phabricator.wikimedia.org/T301382) [01:22:49] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34043/console" [puppet] - 10https://gerrit.wikimedia.org/r/767611 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [01:23:08] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/767611 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [01:25:19] (03CR) 10Razzi: [V: 03+1 C: 03+2] analytics_cluster::datahub::opensearch: add firewall and base_checks [puppet] - 10https://gerrit.wikimedia.org/r/767611 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [01:37:55] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:42:44] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on datahubsearch[1001-1003].eqiad.wmnet with reason: Still having errors setting up opensearch [01:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:47] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on datahubsearch[1001-1003].eqiad.wmnet with reason: Still having errors setting up opensearch [01:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:55] (JobUnavailable) firing: (2) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:17:08] RECOVERY - WDQS high update lag on wdqs1004 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 1.603e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:52:41] (03CR) 10SBassett: [C: 03+1] Use namespaced ApiFeatureUsageQueryEngineElastica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767596 (https://phabricator.wikimedia.org/T302907) (owner: 10Reedy) [02:54:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [02:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [02:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T300992)', diff saved to https://phabricator.wikimedia.org/P21731 and previous config saved to /var/cache/conftool/dbconfig/20220303-025500-ladsgroup.json [02:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:03] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [03:05:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [03:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [03:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T302950)', diff saved to https://phabricator.wikimedia.org/P21732 and previous config saved to /var/cache/conftool/dbconfig/20220303-030518-ladsgroup.json [03:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:21] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [03:06:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300992)', diff saved to https://phabricator.wikimedia.org/P21733 and previous config saved to /var/cache/conftool/dbconfig/20220303-030618-ladsgroup.json [03:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:21] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [03:07:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2147.codfw.wmnet with OS bullseye [03:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P21734 and previous config saved to /var/cache/conftool/dbconfig/20220303-032123-ladsgroup.json [03:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2147.codfw.wmnet with reason: host reimage [03:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2147.codfw.wmnet with reason: host reimage [03:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:34:34] 10ops-codfw, 10DBA: db2147 SMART error - https://phabricator.wikimedia.org/T302951 (10Ladsgroup) [03:36:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P21735 and previous config saved to /var/cache/conftool/dbconfig/20220303-033628-ladsgroup.json [03:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:35] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:40:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2147.codfw.wmnet with OS bullseye [03:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:55] (ProbeHttpFailed) resolved: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [03:46:43] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:49:15] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 8 probes of 740 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:51:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T300992)', diff saved to https://phabricator.wikimedia.org/P21736 and previous config saved to /var/cache/conftool/dbconfig/20220303-035134-ladsgroup.json [03:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:38] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [03:51:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [03:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [03:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [03:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [03:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:52:35] 10SRE, 10ops-codfw, 10DBA: db2147 SMART error - https://phabricator.wikimedia.org/T302951 (10Ladsgroup) Reimage is done and I started replication so it doesn't stay behind for that long but I haven't repooled it, let me know when you want to swap the disk so I shut down mysql I'm new to all of this. Can it... [03:53:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [03:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Maintenance [03:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2140 (T302950)', diff saved to https://phabricator.wikimedia.org/P21737 and previous config saved to /var/cache/conftool/dbconfig/20220303-035328-ladsgroup.json [03:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:53:31] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [03:56:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2140.codfw.wmnet with OS bullseye [03:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [03:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [03:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [03:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [03:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T300992)', diff saved to https://phabricator.wikimedia.org/P21738 and previous config saved to /var/cache/conftool/dbconfig/20220303-035954-ladsgroup.json [03:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:57] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [04:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300992)', diff saved to https://phabricator.wikimedia.org/P21739 and previous config saved to /var/cache/conftool/dbconfig/20220303-040412-ladsgroup.json [04:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:13:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2140.codfw.wmnet with reason: host reimage [04:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2140.codfw.wmnet with reason: host reimage [04:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:19:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P21740 and previous config saved to /var/cache/conftool/dbconfig/20220303-041916-ladsgroup.json [04:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2140.codfw.wmnet with OS bullseye [04:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P21741 and previous config saved to /var/cache/conftool/dbconfig/20220303-043421-ladsgroup.json [04:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2140 (T302950)', diff saved to https://phabricator.wikimedia.org/P21742 and previous config saved to /var/cache/conftool/dbconfig/20220303-043759-ladsgroup.json [04:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:02] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [04:39:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [04:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2136.codfw.wmnet with reason: Maintenance [04:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T302950)', diff saved to https://phabricator.wikimedia.org/P21743 and previous config saved to /var/cache/conftool/dbconfig/20220303-043942-ladsgroup.json [04:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2136.codfw.wmnet with OS bullseye [04:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T300992)', diff saved to https://phabricator.wikimedia.org/P21744 and previous config saved to /var/cache/conftool/dbconfig/20220303-044926-ladsgroup.json [04:49:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [04:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [04:49:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T300992)', diff saved to https://phabricator.wikimedia.org/P21745 and previous config saved to /var/cache/conftool/dbconfig/20220303-044933-ladsgroup.json [04:50:29] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [04:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2136.codfw.wmnet with reason: host reimage [04:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2136.codfw.wmnet with reason: host reimage [04:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300992)', diff saved to https://phabricator.wikimedia.org/P21746 and previous config saved to /var/cache/conftool/dbconfig/20220303-051444-ladsgroup.json [05:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:49] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [05:14:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2136.codfw.wmnet with OS bullseye [05:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21747 and previous config saved to /var/cache/conftool/dbconfig/20220303-052949-ladsgroup.json [05:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T302950)', diff saved to https://phabricator.wikimedia.org/P21748 and previous config saved to /var/cache/conftool/dbconfig/20220303-053324-ladsgroup.json [05:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:29] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [05:37:22] hello friends! upstream connect error or disconnect/reset before headers. reset reason: overflow [05:37:37] yep, seeing that too [05:37:39] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:37:41] your canary as ever, Tamzin [05:38:04] darn I actually lost an edit on that one :-/ at least nothing big [05:39:19] 10SRE, 10User-Ladsgroup, 10Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505 (10MZMcBride) >>! In T301505#7742137, @akosiaris wrote: > Hi! This resurfaced during the weekend. It is not a single issue (despite appe... [05:39:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [05:41:27] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [05:41:39] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:41:39] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:41:39] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:41:39] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp5015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:41:43] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:41:43] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:41:45] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:41:49] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:41:49] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp5016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:41:53] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:41:53] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:42:05] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:42:13] here, looking [05:42:55] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:42:55] (ProbeHttpFailed) firing: URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [05:42:59] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp5010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:43:05] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp5008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [05:44:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P21749 and previous config saved to /var/cache/conftool/dbconfig/20220303-054454-ladsgroup.json [05:44:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [05:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:49] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:45:49] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:45:49] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.472 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:45:49] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp5015 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:45:53] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:45:53] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.463 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:45:55] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 472 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:45:59] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [05:46:00] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp5016 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.460 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:46:00] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5009 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.472 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:46:05] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 474 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:46:05] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.469 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:46:17] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5007 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:46:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [05:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: Maintenance [05:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T302950)', diff saved to https://phabricator.wikimedia.org/P21750 and previous config saved to /var/cache/conftool/dbconfig/20220303-054657-ladsgroup.json [05:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:00] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [05:47:25] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp5010 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.500 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:47:31] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp5008 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.456 second response time https://wikitech.wikimedia.org/wiki/Varnish [05:47:55] (JobUnavailable) firing: (3) Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:47:55] (ProbeHttpFailed) resolved: (2) URL did not return HTTP 2xx or 3xx response (or probe/connection failed) - https://wikitech.wikimedia.org/wiki/Prometheus#Watchrat_Non-23xx_HTTP_response - https://grafana.wikimedia.org/d/GYciEga7z/watchrat - https://alerts.wikimedia.org [05:52:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2119.codfw.wmnet with OS bullseye [05:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:29] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:59:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T300992)', diff saved to https://phabricator.wikimedia.org/P21751 and previous config saved to /var/cache/conftool/dbconfig/20220303-055959-ladsgroup.json [06:00:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [06:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [06:00:04] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [06:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T300992)', diff saved to https://phabricator.wikimedia.org/P21752 and previous config saved to /var/cache/conftool/dbconfig/20220303-060006-ladsgroup.json [06:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300992)', diff saved to https://phabricator.wikimedia.org/P21753 and previous config saved to /var/cache/conftool/dbconfig/20220303-060423-ladsgroup.json [06:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2119.codfw.wmnet with reason: host reimage [06:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2119.codfw.wmnet with reason: host reimage [06:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21754 and previous config saved to /var/cache/conftool/dbconfig/20220303-061928-ladsgroup.json [06:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2119.codfw.wmnet with OS bullseye [06:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T302950)', diff saved to https://phabricator.wikimedia.org/P21755 and previous config saved to /var/cache/conftool/dbconfig/20220303-063350-ladsgroup.json [06:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:54] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [06:34:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P21756 and previous config saved to /var/cache/conftool/dbconfig/20220303-063433-ladsgroup.json [06:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [06:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [06:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T302950)', diff saved to https://phabricator.wikimedia.org/P21757 and previous config saved to /var/cache/conftool/dbconfig/20220303-063514-ladsgroup.json [06:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2106.codfw.wmnet with OS bullseye [06:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T300992)', diff saved to https://phabricator.wikimedia.org/P21758 and previous config saved to /var/cache/conftool/dbconfig/20220303-064937-ladsgroup.json [06:49:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [06:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [06:49:43] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [06:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T300992)', diff saved to https://phabricator.wikimedia.org/P21759 and previous config saved to /var/cache/conftool/dbconfig/20220303-064945-ladsgroup.json [06:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2106.codfw.wmnet with reason: host reimage [06:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:33] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 48.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:52:35] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 28.69 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:54:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300992)', diff saved to https://phabricator.wikimedia.org/P21760 and previous config saved to /var/cache/conftool/dbconfig/20220303-065405-ladsgroup.json [06:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2106.codfw.wmnet with reason: host reimage [06:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:11] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 73.86 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [06:56:19] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:02:20] (03PS1) 10Ayounsi: Revert "Depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/767105 [07:09:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21761 and previous config saved to /var/cache/conftool/dbconfig/20220303-070910-ladsgroup.json [07:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2106.codfw.wmnet with OS bullseye [07:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T302950)', diff saved to https://phabricator.wikimedia.org/P21762 and previous config saved to /var/cache/conftool/dbconfig/20220303-071800-ladsgroup.json [07:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:04] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [07:24:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P21763 and previous config saved to /var/cache/conftool/dbconfig/20220303-072415-ladsgroup.json [07:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:04] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4034.ulsfo.wmnet with OS buster [07:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:17] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [07:39:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T300992)', diff saved to https://phabricator.wikimedia.org/P21764 and previous config saved to /var/cache/conftool/dbconfig/20220303-073920-ladsgroup.json [07:39:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [07:39:24] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [07:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2090.codfw.wmnet with reason: Maintenance [07:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2090.codfw.wmnet with reason: Maintenance [07:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2090 (T302950)', diff saved to https://phabricator.wikimedia.org/P21765 and previous config saved to /var/cache/conftool/dbconfig/20220303-074209-ladsgroup.json [07:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:12] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [07:45:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2090.codfw.wmnet with OS bullseye [07:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:11] (03PS1) 104nn1l2: fawiki: Remove the Book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767683 (https://phabricator.wikimedia.org/T302957) [07:47:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [07:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:49] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4034.ulsfo.wmnet with reason: host reimage [07:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [07:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [07:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T300992)', diff saved to https://phabricator.wikimedia.org/P21766 and previous config saved to /var/cache/conftool/dbconfig/20220303-075534-ladsgroup.json [07:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:37] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [07:57:08] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4034.ulsfo.wmnet with reason: host reimage [07:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2090.codfw.wmnet with reason: host reimage [07:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:05] Amir1 and apergos: How many deployers does it take to do UTC morning backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220303T0800). [08:00:08] morning! there are no trainees for today and no patches scheduled for deployment in the window [08:00:27] so the answer to that riddle would be: zero! [08:00:39] hi [08:00:50] hello. [08:01:14] o/ [08:01:16] jouncebot did not ping me! [08:01:26] jouncebot: now [08:01:26] For the next 0 hour(s) and 58 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220303T0800) [08:01:34] I don't see any patches listed for you on this window [08:01:37] nope [08:01:44] no patches, no pings! [08:02:36] give me some time, please. I have some connection problems [08:03:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2090.codfw.wmnet with reason: host reimage [08:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:07] should be fixed now [08:05:08] I filed mine in the wrong day [08:05:21] I had some edit conflict with zabe https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=1953750&oldid=1953749 [08:05:34] :-D [08:05:45] well zabe did you want to move yours? [08:05:47] and I probably messed up [08:05:50] apergos: you want to deploy or should I? [08:06:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300992)', diff saved to https://phabricator.wikimedia.org/P21767 and previous config saved to /var/cache/conftool/dbconfig/20220303-080603-ladsgroup.json [08:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:07] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:06:33] taavi: mmmm [08:06:43] I'm still waking up kind of [08:07:12] if you don't mind doing the deed [08:07:16] sure [08:08:40] ok, zabe moved the patch so we now have two patches, no trainees for today [08:08:42] (03CR) 10Majavah: [C: 03+2] fawiki: Remove the Book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767683 (https://phabricator.wikimedia.org/T302957) (owner: 104nn1l2) [08:08:55] apergos, yep, sorry missed your mesage [08:09:09] no worries [08:09:26] (03Merged) 10jenkins-bot: fawiki: Remove the Book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767683 (https://phabricator.wikimedia.org/T302957) (owner: 104nn1l2) [08:09:29] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/767174 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:10:38] nn1l2: please test on mwdebug1001 [08:10:44] nn1l2: are there still articles left in the Book namespace (redirects even)? if so, those will need to be cleaned up [08:11:01] no [08:11:17] no articles, all have been moved without leaving a redirect [08:11:26] perfect, and I assume talk pages too [08:11:59] LGTM [08:12:11] great [08:12:23] ok, syncing [08:13:18] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:767683|fawiki: Remove the Book namespace (T302957)]] (duration: 00m 51s) [08:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:23] T302957: Remove the Book namespace from Farsi Wikipedia - https://phabricator.wikimedia.org/T302957 [08:14:15] (03PS4) 10Majavah: Add centralauth-suppress to steward and wmf-supportsafety at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766306 (https://phabricator.wikimedia.org/T302675) (owner: 10Zabe) [08:14:26] (03CR) 10Majavah: [C: 03+2] Add centralauth-suppress to steward and wmf-supportsafety at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766306 (https://phabricator.wikimedia.org/T302675) (owner: 10Zabe) [08:15:10] (03Merged) 10jenkins-bot: Add centralauth-suppress to steward and wmf-supportsafety at metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766306 (https://phabricator.wikimedia.org/T302675) (owner: 10Zabe) [08:15:39] zabe: your patch is available for testing on mwdebug1001 [08:16:45] Thanks! [08:16:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2090.codfw.wmnet with OS bullseye [08:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:04] taavi, lgtm [08:17:24] nn1l2: you're welcome [08:17:25] zabe: syncing [08:18:26] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:766306|Add centralauth-suppress to steward and wmf-supportsafety at metawiki (T302675)]] (duration: 00m 50s) [08:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:29] T302675: Rename centralauth-oversight to centralauth-suppress following the rename of oversight to suppress - https://phabricator.wikimedia.org/T302675 [08:18:44] that's live too [08:19:06] thx [08:19:21] do you want me to update the global groups now too? [08:19:23] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:19:49] hmm, I hope that restbase alert is not related to these config changes [08:19:52] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4034.ulsfo.wmnet with OS buster [08:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:04] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4034.ulsfo.wmnet with OS buster c... [08:21:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P21768 and previous config saved to /var/cache/conftool/dbconfig/20220303-082108-ladsgroup.json [08:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:56] the link in the alert goes nowhere, sadly [08:22:32] it's only on one host though. [08:24:26] zabe: ^ see above re sysadmin group [08:26:02] taavi, yes, you can do it [08:26:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2090 (T302950)', diff saved to https://phabricator.wikimedia.org/P21770 and previous config saved to /var/cache/conftool/dbconfig/20220303-082656-ladsgroup.json [08:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:27] hi, is it too late to add a config patch to this window? [08:28:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2073.codfw.wmnet with reason: Maintenance [08:28:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2073.codfw.wmnet with reason: Maintenance [08:28:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [08:28:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [08:28:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2073 (T302950)', diff saved to https://phabricator.wikimedia.org/P21771 and previous config saved to /var/cache/conftool/dbconfig/20220303-082842-ladsgroup.json [08:28:49] kostajh: probably not [08:28:54] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [08:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:11] ok, I'll update the calendar [08:30:56] I guess hnowlan will know better about the restbeas2025 host, there has been a migration and some decommissioning going on [08:31:17] I can't easily find the exact status in the phab task [08:31:35] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: remove realm check, move listen_addresses to hiera [puppet] - 10https://gerrit.wikimedia.org/r/767484 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [08:31:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2073.codfw.wmnet with OS bullseye [08:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:42] although, I just noticed that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/766869 doesn't set a 'default' key, do we have a policy about requiring that? [08:32:03] if so, I can amend the patch to set a default [08:32:12] good question [08:32:31] I don't know if there's a poilcy but it sounds like a good idea regardless [08:32:50] (03CR) 10Awight: [C: 03+1] "Patch looks right, but the commit summary might need adjustment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767512 (https://phabricator.wikimedia.org/T286991) (owner: 10WMDE-Fisch) [08:33:06] alright, let me update it, and add to the calendar [08:33:16] sounds good [08:33:24] (03PS3) 10Kosta Harlan: GLAM event: Update wgGECampaigns and wgGECampaignTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766869 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [08:33:28] (03CR) 10Awight: [C: 03+1] VE template back and delete button on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767508 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [08:34:12] hey taavi, I see you here regularly for this window, have you thought about adding your name to the lsit of deployers? currently (since the time of the window changed) it's only me and Amir listed, a bit lonely, especially since his sleep schedule shifts around a lot :-D [08:34:44] !log installing expat security updates [08:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:09] (03CR) 10Awight: [C: 03+1] VE template back and delete button on all wikis except enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767508 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [08:35:29] (03CR) 10Awight: [C: 03+1] Template search improvements to all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767510 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [08:36:09] (03CR) 10Awight: [C: 03+1] Bracket matching on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767499 (https://phabricator.wikimedia.org/T280023) (owner: 10WMDE-Fisch) [08:36:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P21772 and previous config saved to /var/cache/conftool/dbconfig/20220303-083613-ladsgroup.json [08:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:25] (03PS4) 10Kosta Harlan: GLAM event: Update wgGECampaigns and wgGECampaignTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766869 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [08:36:37] (03CR) 10Awight: [C: 03+1] Syntax highlighting color scheme update on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767498 (https://phabricator.wikimedia.org/T280024) (owner: 10WMDE-Fisch) [08:36:45] ready; apergos & taavi would you mind having a look at it? [08:37:03] it's going to be a no-op, it's putting configuration in place that will be used once wmf.24 is in group2 [08:37:41] hmm so we won't know if there is some weird behaviour until later [08:37:58] will you be able to be around for that (or someone else who is familiar with it)? [08:38:27] yeah, m.ewoph will be around later [08:38:31] aweseom [08:38:32] e [08:38:54] the patch looks ok to me, I'll defer to the person actually doing today's deploys however [08:39:13] lgtm too, happy to deploy it [08:39:20] taavi: thanks [08:39:41] taavi: I can verify that user registration with and without the campaign parameter doesn't break things, at least. please ping me when it's on mwdebug [08:39:59] apergos: I can't commit to actually being able to train people at this time slot, but if 'usually around and able to deploy' is enough feel free to add myself [08:40:01] sure [08:40:09] yes, you don't have to be a trainer [08:40:14] (03CR) 10Majavah: [C: 03+2] GLAM event: Update wgGECampaigns and wgGECampaignTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766869 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [08:40:17] I'm happy to do the main part of that load [08:40:47] thcipriani: can you add taavi to the list of deployers for this slot? ^^ thanks in advance! [08:40:49] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:40:57] (03Merged) 10jenkins-bot: GLAM event: Update wgGECampaigns and wgGECampaignTopics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766869 (https://phabricator.wikimedia.org/T301029) (owner: 10MewOphaswongse) [08:41:58] kostajh: pulled to mwdebug1001 [08:42:06] taavi: thanks, having a look [08:45:18] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34046/console" [puppet] - 10https://gerrit.wikimedia.org/r/767607 (https://phabricator.wikimedia.org/T276198) (owner: 10Cwhite) [08:45:20] (03CR) 10Filippo Giunchedi: [C: 03+1] aptrepo: update grafana version to <8.4 [puppet] - 10https://gerrit.wikimedia.org/r/767608 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [08:46:30] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] opensearch: prevent rundir from deletion [puppet] - 10https://gerrit.wikimedia.org/r/767607 (https://phabricator.wikimedia.org/T276198) (owner: 10Cwhite) [08:46:44] taavi: nothing broke, as far as I could tell :) [08:47:02] cool, syncing [08:47:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2073.codfw.wmnet with reason: host reimage [08:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:15] (03PS1) 10Muehlenhoff: Add Cumin alias for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/767709 [08:48:07] (03PS2) 10Muehlenhoff: Add Cumin alias for datahubsearch [puppet] - 10https://gerrit.wikimedia.org/r/767709 [08:48:32] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:766869|GLAM event: Update wgGECampaigns and wgGECampaignTopics (T301029)]] (duration: 00m 51s) [08:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:35] T301029: Account creation: GLAM event topic availability - https://phabricator.wikimedia.org/T301029 [08:48:40] kostajh: deployed! [08:49:54] taavi: thank you [08:49:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2073.codfw.wmnet with reason: host reimage [08:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T300992)', diff saved to https://phabricator.wikimedia.org/P21773 and previous config saved to /var/cache/conftool/dbconfig/20220303-085118-ladsgroup.json [08:51:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [08:51:22] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [08:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T300992)', diff saved to https://phabricator.wikimedia.org/P21774 and previous config saved to /var/cache/conftool/dbconfig/20220303-085125-ladsgroup.json [08:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:14] that might be the end of the window then? [08:52:38] !log UTC morning deploys done [08:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:03] thanks for doing all the work :-) and agreeing to be added by name to the window! see everyone next time [08:54:08] (03PS1) 10Jcrespo: Prepare for release 0.6 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767710 [08:54:54] (03PS1) 10Muehlenhoff: Fix Cumin alias for an-tool* [puppet] - 10https://gerrit.wikimedia.org/r/767711 [08:54:56] (03CR) 10Jcrespo: [C: 03+2] Prepare for release 0.6 [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767710 (owner: 10Jcrespo) [08:56:25] (03PS1) 10Kosta Harlan: GLAM event: Update landing page content [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) [08:57:56] (03PS3) 10Ladsgroup: auto_schema: Add support for --check in running schema changes [software] - 10https://gerrit.wikimedia.org/r/767554 (https://phabricator.wikimedia.org/T301896) [08:58:33] (03PS2) 10Muehlenhoff: Fix Cumin alias for an-tool* [puppet] - 10https://gerrit.wikimedia.org/r/767711 [09:00:38] (03PS4) 10Ladsgroup: auto_schema: Add support for --check in running schema changes [software] - 10https://gerrit.wikimedia.org/r/767554 (https://phabricator.wikimedia.org/T301896) [09:00:51] (03PS4) 10Elukey: Add kubernetes20[19-22] to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) [09:01:19] !log restarting superset on an-tool1010 to pick up expat security updates [09:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:56] (03CR) 10Elukey: [C: 04-1] Add kubernetes20[19-22] to wikikube codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:02:24] (03PS1) 10Jcrespo: Use yaml safeloader to parse config files [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767716 [09:03:35] (03PS5) 10Elukey: Add kubernetes20[19-22] to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) [09:04:31] (03CR) 10Ladsgroup: "Tested in cumin and works as expected." [software] - 10https://gerrit.wikimedia.org/r/767554 (https://phabricator.wikimedia.org/T301896) (owner: 10Ladsgroup) [09:04:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2073.codfw.wmnet with OS bullseye [09:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:26] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 6 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34047/console" [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:07:37] (03PS6) 10Elukey: Add kubernetes20[19-22] to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) [09:07:44] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Apache on an-web [puppet] - 10https://gerrit.wikimedia.org/r/767717 (https://phabricator.wikimedia.org/T135991) [09:08:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34048/console" [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:09:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/767717 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:12:25] !log restarting FPM/Apache on mw API servers to pick up expat security updates [09:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2073 (T302950)', diff saved to https://phabricator.wikimedia.org/P21775 and previous config saved to /var/cache/conftool/dbconfig/20220303-091340-ladsgroup.json [09:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:43] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [09:13:55] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Bracket matching on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767499 (https://phabricator.wikimedia.org/T280023) (owner: 10WMDE-Fisch) [09:14:11] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Template search improvements to all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767510 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [09:14:31] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] VE template back and delete button on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767508 (https://phabricator.wikimedia.org/T286990) (owner: 10WMDE-Fisch) [09:14:55] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Syntax highlighting color scheme update on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767498 (https://phabricator.wikimedia.org/T280024) (owner: 10WMDE-Fisch) [09:16:03] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] VE template extended sidebar and inline descriptions on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767512 (https://phabricator.wikimedia.org/T286991) (owner: 10WMDE-Fisch) [09:16:35] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add kubernetes20[19-22] to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:16:41] (03PS7) 10Elukey: Add kubernetes20[19-22] to wikikube codfw [puppet] - 10https://gerrit.wikimedia.org/r/767482 (https://phabricator.wikimedia.org/T302208) [09:17:51] (03CR) 10jerkins-bot: [V: 04-1] GLAM event: Update landing page content [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [09:18:31] (03PS1) 10Ladsgroup: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767719 (https://phabricator.wikimedia.org/T302950) [09:19:26] (03CR) 10Ladsgroup: [C: 03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767719 (https://phabricator.wikimedia.org/T302950) (owner: 10Ladsgroup) [09:19:55] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [09:27:50] (03PS1) 10Filippo Giunchedi: Change Grafana dashboard links to new format [puppet] - 10https://gerrit.wikimedia.org/r/767720 (https://phabricator.wikimedia.org/T302958) [09:33:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [09:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance [09:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T302950)', diff saved to https://phabricator.wikimedia.org/P21777 and previous config saved to /var/cache/conftool/dbconfig/20220303-093306-ladsgroup.json [09:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:09] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [09:36:04] 10SRE, 10Infrastructure-Foundations, 10netops: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) Slightly related, as of today those devices don't support ssh-ed25519: (11) asw2-b-eqiad.mgmt.eqiad.wmnet,asw2-c-eqiad.mgmt.eqiad.wmnet,asw2-d-eqi... [09:36:18] (03PS3) 10WMDE-Fisch: VE template expanded sidebar and inline descriptions on all wikis except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767512 (https://phabricator.wikimedia.org/T286991) [09:36:46] (03PS1) 10Ladsgroup: rdbms: Change getConnectionRef to return with getLazyConnectionRef [core] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767691 (https://phabricator.wikimedia.org/T255493) [09:36:56] (03CR) 10WMDE-Fisch: "ded" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767512 (https://phabricator.wikimedia.org/T286991) (owner: 10WMDE-Fisch) [09:37:02] (03PS1) 10Ladsgroup: rdbms: Change getConnectionRef to return with getLazyConnectionRef [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/767692 (https://phabricator.wikimedia.org/T255493) [09:37:27] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@1c8384f]: AF //tion default args [09:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1160.eqiad.wmnet with OS bullseye [09:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:36] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@1c8384f]: AF //tion default args (duration: 00m 09s) [09:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:03] (03CR) 10WMDE-Fisch: VE template expanded sidebar and inline descriptions on all wikis except enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767512 (https://phabricator.wikimedia.org/T286991) (owner: 10WMDE-Fisch) [09:42:06] (03CR) 10Elukey: [C: 03+2] Add BGP config for kubernetes20[19-22] in wikikube codfw [homer/public] - 10https://gerrit.wikimedia.org/r/767485 (https://phabricator.wikimedia.org/T302208) (owner: 10Elukey) [09:43:06] Please go to https://wikitech.wikimedia.org/wiki/Deployments [09:43:25] Then click on [curr] on the side bar [09:43:47] It will take you to 20220208T0800 [09:43:58] while today is 3 March [09:44:39] That's why both me and zabe scheduled our patches in the wrong window during the latest B&C [09:47:55] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:48:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1160.eqiad.wmnet with reason: host reimage [09:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:05] Hi taavi, you probably know what I am talking about as you were the deployer at that B&C [09:51:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300992)', diff saved to https://phabricator.wikimedia.org/P21778 and previous config saved to /var/cache/conftool/dbconfig/20220303-095145-ladsgroup.json [09:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:50] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [09:53:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1160.eqiad.wmnet with reason: host reimage [09:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:01] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [09:56:02] (03PS1) 10Filippo Giunchedi: Update Grafana dashboard links to new format [alerts] - 10https://gerrit.wikimedia.org/r/767726 [10:00:36] (03PS2) 10Filippo Giunchedi: Change Grafana dashboard links to new format [alerts] - 10https://gerrit.wikimedia.org/r/767726 (https://phabricator.wikimedia.org/T302958) [10:01:24] I'm seeking reviewers for https://gerrit.wikimedia.org/r/c/operations/puppet/+/767720 and https://gerrit.wikimedia.org/r/c/operations/alerts/+/767726 (both straightforward) [10:04:05] wow long ones :) [10:04:10] * elukey reviewing [10:04:51] yeah it is all mechanical changes really [10:05:38] going to check the links and +1 in say 10 [10:06:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P21779 and previous config saved to /var/cache/conftool/dbconfig/20220303-100649-ladsgroup.json [10:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:01] thanks, appreciate it [10:09:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1160.eqiad.wmnet with OS bullseye [10:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:35] (03CR) 10Elukey: [C: 03+1] "Checked all the new dashboard links and everything looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/767720 (https://phabricator.wikimedia.org/T302958) (owner: 10Filippo Giunchedi) [10:14:56] (03PS1) 10Filippo Giunchedi: profile: issue warnings for check_mw_versions [puppet] - 10https://gerrit.wikimedia.org/r/767729 (https://phabricator.wikimedia.org/T302832) [10:15:40] (03PS1) 10Alexandros Kosiaris: rdb1012: Switch to use rdb1011 as master [puppet] - 10https://gerrit.wikimedia.org/r/767730 (https://phabricator.wikimedia.org/T281217) [10:15:42] (03PS1) 10Alexandros Kosiaris: rdb1006: Switch to decom [puppet] - 10https://gerrit.wikimedia.org/r/767731 (https://phabricator.wikimedia.org/T281217) [10:15:44] (03PS1) 10Alexandros Kosiaris: rdb1005: Switch all usages to rdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/767732 (https://phabricator.wikimedia.org/T281217) [10:15:46] (03PS1) 10Alexandros Kosiaris: rdb1011: Switch to master [puppet] - 10https://gerrit.wikimedia.org/r/767733 (https://phabricator.wikimedia.org/T281217) [10:15:48] (03PS1) 10Alexandros Kosiaris: rdb1005: Move to system::spare [puppet] - 10https://gerrit.wikimedia.org/r/767734 (https://phabricator.wikimedia.org/T281217) [10:16:12] (03CR) 10Elukey: [C: 03+1] "Tested all new links, dashboards load correctly, LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/767726 (https://phabricator.wikimedia.org/T302958) (owner: 10Filippo Giunchedi) [10:16:15] (03CR) 10Filippo Giunchedi: [C: 03+2] Change Grafana dashboard links to new format [puppet] - 10https://gerrit.wikimedia.org/r/767720 (https://phabricator.wikimedia.org/T302958) (owner: 10Filippo Giunchedi) [10:16:22] godog: green light :) [10:16:23] thank you elukey [10:16:27] <3 [10:16:31] (03CR) 10Filippo Giunchedi: [C: 03+2] Change Grafana dashboard links to new format [alerts] - 10https://gerrit.wikimedia.org/r/767726 (https://phabricator.wikimedia.org/T302958) (owner: 10Filippo Giunchedi) [10:17:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T302950)', diff saved to https://phabricator.wikimedia.org/P21780 and previous config saved to /var/cache/conftool/dbconfig/20220303-101704-ladsgroup.json [10:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:07] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [10:17:48] (03CR) 10Filippo Giunchedi: logstash: add blackbox-exporter filter config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:18:00] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/767735 [10:18:56] !log kubectl cordon kubernetes200[1-4] to avoid scheduling pods on nodes that will be decommed during the next weeks - T302208 [10:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:59] T302208: setup/install kubernetes20[1(89)|2(012)] - https://phabricator.wikimedia.org/T302208 [10:19:34] (03PS2) 10Volans: CHANGELOG: add changelogs for release v2.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/767735 [10:20:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] rdb1012: Switch to use rdb1011 as master [puppet] - 10https://gerrit.wikimedia.org/r/767730 (https://phabricator.wikimedia.org/T281217) (owner: 10Alexandros Kosiaris) [10:21:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P21781 and previous config saved to /var/cache/conftool/dbconfig/20220303-102154-ladsgroup.json [10:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:36] (03CR) 10Elukey: [C: 03+1] rdb1005: Switch all usages to rdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/767732 (https://phabricator.wikimedia.org/T281217) (owner: 10Alexandros Kosiaris) [10:25:34] jouncebot: nowandnext [10:25:34] No deployments scheduled for the next 0 hour(s) and 34 minute(s) [10:25:34] In 0 hour(s) and 34 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220303T1100) [10:25:42] coool [10:26:01] (03CR) 10Ladsgroup: [C: 03+2] rdbms: Change getConnectionRef to return with getLazyConnectionRef [core] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767691 (https://phabricator.wikimedia.org/T255493) (owner: 10Ladsgroup) [10:26:05] (03CR) 10Ladsgroup: [C: 03+2] rdbms: Change getConnectionRef to return with getLazyConnectionRef [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/767692 (https://phabricator.wikimedia.org/T255493) (owner: 10Ladsgroup) [10:28:51] (03CR) 10Jelto: gitlab: add ferm rules and fix listen_addresses for test instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/762803 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [10:30:14] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/767105 (owner: 10Ayounsi) [10:30:43] !log repool ulsfo [10:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P21782 and previous config saved to /var/cache/conftool/dbconfig/20220303-103209-ladsgroup.json [10:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:42] (03PS3) 10Volans: CHANGELOG: add changelogs for release v2.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/767735 [10:33:03] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/767735 (owner: 10Volans) [10:36:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300992)', diff saved to https://phabricator.wikimedia.org/P21783 and previous config saved to /var/cache/conftool/dbconfig/20220303-103659-ladsgroup.json [10:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:02] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [10:37:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] rdb1006: Switch to decom [puppet] - 10https://gerrit.wikimedia.org/r/767731 (https://phabricator.wikimedia.org/T281217) (owner: 10Alexandros Kosiaris) [10:39:05] (03PS1) 10Ladsgroup: Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767694 [10:39:11] (03PS2) 10Ladsgroup: Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767694 [10:39:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767694 (owner: 10Ladsgroup) [10:39:35] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.1.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/767735 (owner: 10Volans) [10:41:07] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops, 10observability: icinga config error for new rows E/R - https://phabricator.wikimedia.org/T302940 (10ayounsi) https://gerrit.wikimedia.org/r/c/operations/puppet/+/764791 should fix the issue. About hostname vs. FQDN is because the devices use LLDP... [10:41:17] (03PS4) 10Ayounsi: Adding more new LEAF switches from Eqiad rows E/F to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [10:42:03] 10SRE, 10ops-codfw, 10DBA: db2147 SMART error - https://phabricator.wikimedia.org/T302951 (10JMeybohm) [10:43:57] (03PS1) 10Volans: Upstream release v2.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/767740 [10:44:27] (03Merged) 10jenkins-bot: rdbms: Change getConnectionRef to return with getLazyConnectionRef [core] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767691 (https://phabricator.wikimedia.org/T255493) (owner: 10Ladsgroup) [10:44:33] (03Merged) 10jenkins-bot: rdbms: Change getConnectionRef to return with getLazyConnectionRef [core] (wmf/1.38.0-wmf.23) - 10https://gerrit.wikimedia.org/r/767692 (https://phabricator.wikimedia.org/T255493) (owner: 10Ladsgroup) [10:46:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/767740 (owner: 10Volans) [10:46:19] (03CR) 10Volans: [C: 03+2] Upstream release v2.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/767740 (owner: 10Volans) [10:47:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P21784 and previous config saved to /var/cache/conftool/dbconfig/20220303-104713-ladsgroup.json [10:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:57] (03PS1) 10Alexandros Kosiaris: Switch all usages of rdb1005 to rdb1011 [deployment-charts] - 10https://gerrit.wikimedia.org/r/767742 (https://phabricator.wikimedia.org/T281217) [10:50:13] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.24/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Backport: [[gerrit:767691|rdbms: Change getConnectionRef to return with getLazyConnectionRef (T255493)]] (duration: 00m 51s) [10:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:16] T255493: Consider phasing out ILoadBalancer::getConnectionRef in favour of ILoadBalancer::getLazyConnectionRef - https://phabricator.wikimedia.org/T255493 [10:52:32] (03CR) 10Sergio Gimeno: "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [10:53:02] (03Merged) 10jenkins-bot: Upstream release v2.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/767740 (owner: 10Volans) [10:58:32] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.23/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Backport: [[gerrit:767692|rdbms: Change getConnectionRef to return with getLazyConnectionRef (T255493)]] (duration: 00m 50s) [10:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:36] T255493: Consider phasing out ILoadBalancer::getConnectionRef in favour of ILoadBalancer::getLazyConnectionRef - https://phabricator.wikimedia.org/T255493 [10:59:25] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops, 10observability: icinga config error for new rows E/R - https://phabricator.wikimedia.org/T302940 (10cmooney) @robh apologies for this, I was working on an improved version of the CR Arzhel lists above yesterday. But it should have occurred to me... [11:00:05] mvolz: That opportune time is upon us again. Time for a Services – Citoid / Zotero deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220303T1100). [11:00:50] 10SRE, 10ops-codfw, 10DBA: db2147 SMART error - https://phabricator.wikimedia.org/T302951 (10Kormat) p:05Triage→03High [11:02:21] !log kormat@cumin1001 dbctl commit (dc=all): 'Start repooling db1126 to full weight', diff saved to https://phabricator.wikimedia.org/P21785 and previous config saved to /var/cache/conftool/dbconfig/20220303-110220-kormat.json [11:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T302950)', diff saved to https://phabricator.wikimedia.org/P21786 and previous config saved to /var/cache/conftool/dbconfig/20220303-110224-ladsgroup.json [11:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:27] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [11:02:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Repooling to 100% after incident', diff saved to https://phabricator.wikimedia.org/P21787 and previous config saved to /var/cache/conftool/dbconfig/20220303-110257-kormat.json [11:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:08] (03CR) 10Kosta Harlan: "CI failure is T302964" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [11:06:57] (03CR) 10David Caro: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/767720 (https://phabricator.wikimedia.org/T302958) (owner: 10Filippo Giunchedi) [11:18:01] !log kormat@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 75%: Repooling to 100% after incident', diff saved to https://phabricator.wikimedia.org/P21788 and previous config saved to /var/cache/conftool/dbconfig/20220303-111801-kormat.json [11:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:06] (03CR) 10Isabelle Hurbain-Palatin: "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [11:25:10] (03PS1) 10Jbond: P:idp::client: add Wmflib::HTTP::SameSite type [puppet] - 10https://gerrit.wikimedia.org/r/767744 [11:25:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34049/console" [puppet] - 10https://gerrit.wikimedia.org/r/767744 (owner: 10Jbond) [11:29:03] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:idp::client: add Wmflib::HTTP::SameSite type [puppet] - 10https://gerrit.wikimedia.org/r/767744 (owner: 10Jbond) [11:31:49] (03PS11) 10Cathal Mooney: Add EVPN overlay loopback subnets to CR BGP policy to switches [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) [11:31:52] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 18): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34050/console" [puppet] - 10https://gerrit.wikimedia.org/r/767744 (owner: 10Jbond) [11:32:44] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:33:05] !log kormat@cumin1001 dbctl commit (dc=all): 'db1126 (re)pooling @ 100%: Repooling to 100% after incident', diff saved to https://phabricator.wikimedia.org/P21789 and previous config saved to /var/cache/conftool/dbconfig/20220303-113304-kormat.json [11:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:54] 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MSantos) [11:45:38] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:46:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:48:42] (03PS1) 10Jbond: C:apereo_cas: add documentation and clean up minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/767747 [11:49:18] !log uploaded spicerack_2.1.0 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [11:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:22] (03Abandoned) 10Cathal Mooney: Adding includes for Netbox-generated zone files for eqiad evpn lb [dns] - 10https://gerrit.wikimedia.org/r/767562 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:52:10] (03PS1) 10Cathal Mooney: Try 2 to add Netbox-generated zone files for eqiad evpn loopbacks [dns] - 10https://gerrit.wikimedia.org/r/767748 (https://phabricator.wikimedia.org/T299758) [11:53:46] (03CR) 10Hashar: "check codehealth" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [12:02:43] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/767748 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:08:52] (03CR) 10Cathal Mooney: [C: 03+2] Try 2 to add Netbox-generated zone files for eqiad evpn loopbacks [dns] - 10https://gerrit.wikimedia.org/r/767748 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:10:26] (03PS1) 10Ladsgroup: db1149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767751 (https://phabricator.wikimedia.org/T302950) [12:10:53] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1149: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767751 (https://phabricator.wikimedia.org/T302950) (owner: 10Ladsgroup) [12:13:03] (03CR) 10Hashar: "The SonarQube analyzes fails because "≥ 80.0% Coverage required" and it shows 0 coverage on the new code in SpecialCreateAccountCampaign.p" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [12:13:28] (03PS2) 10Jbond: C:apereo_cas: add documentation and clean up minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/767747 [12:14:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34051/console" [puppet] - 10https://gerrit.wikimedia.org/r/767747 (owner: 10Jbond) [12:14:27] (03PS1) 10Giuseppe Lavagetto: deployment_server: add mediawiki on k8s releases repo [puppet] - 10https://gerrit.wikimedia.org/r/767756 (https://phabricator.wikimedia.org/T299648) [12:14:41] (03CR) 10Kosta Harlan: GLAM event: Update landing page content (031 comment) [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [12:23:43] (03CR) 10Ayounsi: [C: 03+1] "ship it!" [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:25:25] (03PS1) 10Jbond: C:apereo_cas: add support for cas.tgc.same-site-policy [puppet] - 10https://gerrit.wikimedia.org/r/767765 [12:26:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34052/console" [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [12:30:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [12:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [12:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T302950)', diff saved to https://phabricator.wikimedia.org/P21790 and previous config saved to /var/cache/conftool/dbconfig/20220303-123030-ladsgroup.json [12:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:33] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [12:34:31] (03PS2) 10Jbond: C:apereo_cas: add support for cas.tgc.same-site-policy and cas.tgc.pin-to-session [puppet] - 10https://gerrit.wikimedia.org/r/767765 [12:35:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34053/console" [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [12:35:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1149.eqiad.wmnet with OS bullseye [12:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:34] (03CR) 10jerkins-bot: [V: 04-1] C:apereo_cas: add support for cas.tgc.same-site-policy and cas.tgc.pin-to-session [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [12:38:38] (03PS3) 10Jbond: C:apereo_cas: add support for cas.tgc.same-site-policy and cas.tgc.pin-to-session [puppet] - 10https://gerrit.wikimedia.org/r/767765 [12:39:21] (03PS4) 10Jbond: C:apereo_cas: cas.tgc.same-site-policy & cas.tgc.pin-to-session support [puppet] - 10https://gerrit.wikimedia.org/r/767765 [12:42:11] (03PS1) 10Cathal Mooney: Remove puppet subnet definitions for private subnets racke E4/F4 [puppet] - 10https://gerrit.wikimedia.org/r/767772 (https://phabricator.wikimedia.org/T299758) [12:42:51] (03CR) 10Cathal Mooney: "If you wouldn't mind having a look thanks." [puppet] - 10https://gerrit.wikimedia.org/r/767772 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [12:47:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1149.eqiad.wmnet with reason: host reimage [12:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:35] !log Upgrading Quibble on CI Jenkins jobs from 1.3.0 to 1.4.3 https://gerrit.wikimedia.org/r/c/integration/config/+/767749/ [12:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:00] (03PS1) 10Ayounsi: Icinga: add icons to Juniper devices [puppet] - 10https://gerrit.wikimedia.org/r/767773 [12:49:02] (03PS1) 10Ayounsi: Icinga: use parent switch shortname [puppet] - 10https://gerrit.wikimedia.org/r/767774 (https://phabricator.wikimedia.org/T302940) [12:50:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1149.eqiad.wmnet with reason: host reimage [12:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:02] (03CR) 10jerkins-bot: [V: 04-1] Icinga: use parent switch shortname [puppet] - 10https://gerrit.wikimedia.org/r/767774 (https://phabricator.wikimedia.org/T302940) (owner: 10Ayounsi) [12:52:27] (03PS2) 10Cathal Mooney: Add subnet definitions for new Analytics vlans to netops data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/767772 (https://phabricator.wikimedia.org/T299758) [12:59:11] (03PS7) 10Filippo Giunchedi: logstash: add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) [12:59:13] (03PS1) 10Filippo Giunchedi: prometheus: use single probes/service job [puppet] - 10https://gerrit.wikimedia.org/r/767775 (https://phabricator.wikimedia.org/T291946) [13:04:10] (03CR) 10David Caro: [C: 03+1] P:wmcs::prometheus: deploy alert rule from ops/alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [13:04:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P:wmcs::prometheus: deploy alert rule from ops/alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [13:05:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1149.eqiad.wmnet with OS bullseye [13:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:25] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use single probes/service job [puppet] - 10https://gerrit.wikimedia.org/r/767775 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:06:31] (03PS2) 10Filippo Giunchedi: prometheus: use single probes/service job [puppet] - 10https://gerrit.wikimedia.org/r/767775 (https://phabricator.wikimedia.org/T291946) [13:07:32] (03PS1) 10Filippo Giunchedi: sre: add probes cert expiration alert [alerts] - 10https://gerrit.wikimedia.org/r/767778 [13:08:19] (03PS5) 10Jbond: C:apereo_cas: add support for cas.tgc.same-site-policy and cas.tgc.pin-to-session [puppet] - 10https://gerrit.wikimedia.org/r/767765 [13:08:21] (03PS1) 10Jbond: O:idp_test: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767779 [13:08:23] (03PS1) 10Jbond: O:idp: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767780 [13:08:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:apereo_cas: add documentation and clean up minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/767747 (owner: 10Jbond) [13:09:56] (03CR) 10jerkins-bot: [V: 04-1] C:apereo_cas: add support for cas.tgc.same-site-policy and cas.tgc.pin-to-session [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [13:09:59] (03PS6) 10Jbond: C:apereo_cas: add support for cas.tgc.same-site-policy and cas.tgc.pin-to-session [puppet] - 10https://gerrit.wikimedia.org/r/767765 [13:10:09] (03PS2) 10Jbond: O:idp_test: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767779 [13:10:15] (03PS2) 10Jbond: O:idp: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767780 [13:11:31] (03CR) 10jerkins-bot: [V: 04-1] C:apereo_cas: add support for cas.tgc.same-site-policy and cas.tgc.pin-to-session [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [13:11:55] (03CR) 10Filippo Giunchedi: [C: 03+2] sre: add probes cert expiration alert [alerts] - 10https://gerrit.wikimedia.org/r/767778 (owner: 10Filippo Giunchedi) [13:11:59] (03PS2) 10Filippo Giunchedi: sre: add probes cert expiration alert [alerts] - 10https://gerrit.wikimedia.org/r/767778 [13:12:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T302950)', diff saved to https://phabricator.wikimedia.org/P21791 and previous config saved to /var/cache/conftool/dbconfig/20220303-131223-ladsgroup.json [13:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:27] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [13:16:11] (03PS3) 10Phuedx: Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) [13:16:21] (03CR) 10Phuedx: Request high-entropy Sec-CH-UA* client hints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [13:17:56] (03CR) 10Emil Chetty: [V: 03+1] Request high-entropy Sec-CH-UA* client hints [puppet] - 10https://gerrit.wikimedia.org/r/765485 (https://phabricator.wikimedia.org/T301238) (owner: 10Phuedx) [13:20:33] !log restarting FPM/Apache on mw app servers to pick up expat security updates [13:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:32] (03CR) 10David Caro: [C: 03+2] P:wmcs::prometheus: deploy alert rule from ops/alerts.git [puppet] - 10https://gerrit.wikimedia.org/r/765567 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [13:28:42] (03PS1) 10Filippo Giunchedi: prometheus: tag team for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/767784 [13:31:13] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34055/console" [puppet] - 10https://gerrit.wikimedia.org/r/767773 (owner: 10Ayounsi) [13:31:20] (03CR) 10Cathal Mooney: [C: 03+2] Add EVPN overlay loopback subnets to CR BGP policy to switches [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:31:43] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767773 (owner: 10Ayounsi) [13:32:08] (03Merged) 10jenkins-bot: Add EVPN overlay loopback subnets to CR BGP policy to switches [homer/public] - 10https://gerrit.wikimedia.org/r/767570 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [13:32:39] (03PS1) 10Jbond: R:monitoring::host: clean up image selection [puppet] - 10https://gerrit.wikimedia.org/r/767786 [13:32:48] (03PS7) 10Muehlenhoff: C:apereo_cas: make session configurable [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [13:33:06] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [13:33:19] (03CR) 10jerkins-bot: [V: 04-1] R:monitoring::host: clean up image selection [puppet] - 10https://gerrit.wikimedia.org/r/767786 (owner: 10Jbond) [13:33:33] (03PS1) 10Alexandros Kosiaris: scap: Switch mw1306 to mw1318 for scap proxy role [puppet] - 10https://gerrit.wikimedia.org/r/767787 (https://phabricator.wikimedia.org/T273915) [13:33:37] (03PS1) 10Alexandros Kosiaris: mw130[2-6]: Remove and decomission [puppet] - 10https://gerrit.wikimedia.org/r/767788 (https://phabricator.wikimedia.org/T273915) [13:34:21] (03PS2) 10Jbond: R:monitoring::host: clean up image selection [puppet] - 10https://gerrit.wikimedia.org/r/767786 [13:35:10] (03CR) 10Jbond: "LGTM but see comment" [puppet] - 10https://gerrit.wikimedia.org/r/767773 (owner: 10Ayounsi) [13:35:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] Switch all usages of rdb1005 to rdb1011 [deployment-charts] - 10https://gerrit.wikimedia.org/r/767742 (https://phabricator.wikimedia.org/T281217) (owner: 10Alexandros Kosiaris) [13:35:22] (03CR) 10jerkins-bot: [V: 04-1] R:monitoring::host: clean up image selection [puppet] - 10https://gerrit.wikimedia.org/r/767786 (owner: 10Jbond) [13:35:38] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM anyway, that nasty butler is another story 😊" [puppet] - 10https://gerrit.wikimedia.org/r/767774 (https://phabricator.wikimedia.org/T302940) (owner: 10Ayounsi) [13:36:09] (03PS15) 10Btullis: Add a set of charts for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [13:36:36] (03CR) 10jerkins-bot: [V: 04-1] Add a set of charts for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:36:43] (03PS3) 10Jbond: R:monitoring::host: clean up image selection [puppet] - 10https://gerrit.wikimedia.org/r/767786 [13:36:57] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: tag team for netbox reports [puppet] - 10https://gerrit.wikimedia.org/r/767784 (owner: 10Filippo Giunchedi) [13:37:30] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [13:38:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] rdb1005: Switch all usages to rdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/767732 (https://phabricator.wikimedia.org/T281217) (owner: 10Alexandros Kosiaris) [13:38:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/767732 (https://phabricator.wikimedia.org/T281217) (owner: 10Alexandros Kosiaris) [13:38:26] (03CR) 10jerkins-bot: [V: 04-1] R:monitoring::host: clean up image selection [puppet] - 10https://gerrit.wikimedia.org/r/767786 (owner: 10Jbond) [13:39:45] PROBLEM - SSH on dns5001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:40:03] (03Merged) 10jenkins-bot: Switch all usages of rdb1005 to rdb1011 [deployment-charts] - 10https://gerrit.wikimedia.org/r/767742 (https://phabricator.wikimedia.org/T281217) (owner: 10Alexandros Kosiaris) [13:42:29] (03CR) 10Majavah: [C: 03+1] "nit inside, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767467 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [13:42:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P21793 and previous config saved to /var/cache/conftool/dbconfig/20220303-134232-ladsgroup.json [13:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:49] (03PS1) 10Volans: redfish: fix default value [software/spicerack] - 10https://gerrit.wikimedia.org/r/767789 [13:44:07] (03CR) 10Majavah: [C: 03+1] wmcs-cinder-backup-manager: increase individual timeout to 30h [puppet] - 10https://gerrit.wikimedia.org/r/767474 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [13:44:41] !log akosiaris@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons. [13:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:43] !log roll restart ores uwsgi and celery for rdb1005 decommissioning. T281217 [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:46] T281217: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 [13:47:04] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10Joe) a:03Joe [13:47:16] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10Joe) p:05Triage→03Medium [13:47:55] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [13:48:05] (03PS1) 10Majavah: alertmanager: add basic wmcs routing rules [puppet] - 10https://gerrit.wikimedia.org/r/767790 (https://phabricator.wikimedia.org/T302493) [13:48:47] (03CR) 10Jbond: [C: 03+1] "LGTM but see nit from Majavah" [puppet] - 10https://gerrit.wikimedia.org/r/767467 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [13:49:50] (03CR) 10Ayounsi: Icinga: add icons to Juniper devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767773 (owner: 10Ayounsi) [13:51:22] (03PS2) 10David Caro: wmcs: add runbook url to the backup_cinder_volumes alert [puppet] - 10https://gerrit.wikimedia.org/r/767467 (https://phabricator.wikimedia.org/T302855) [13:51:24] (03CR) 10David Caro: wmcs: add runbook url to the backup_cinder_volumes alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767467 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [13:51:54] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [13:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:06] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [13:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:07] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [13:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:10] (03CR) 10Muehlenhoff: "Looks good, two nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [13:52:20] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [13:52:21] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [13:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:25] (03PS2) 10David Caro: wmcs-cinder-backup-manager: increase individual timeout to 30h [puppet] - 10https://gerrit.wikimedia.org/r/767474 (https://phabricator.wikimedia.org/T302855) [13:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:30] (03CR) 10CDanis: [C: 03+1] wmcs: add runbook url to the backup_cinder_volumes alert [puppet] - 10https://gerrit.wikimedia.org/r/767467 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [13:52:53] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [13:52:54] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:02] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:03] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [13:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:16] (03PS1) 10Ladsgroup: Revert "db1149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767704 [13:53:22] (03PS2) 10Ladsgroup: Revert "db1149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767704 [13:53:30] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [13:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:31] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:41] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1149: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/767704 (owner: 10Ladsgroup) [13:54:03] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo IRC channel access" [puppet] - 10https://gerrit.wikimedia.org/r/767790 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [13:54:06] !log switch changeprop, changeprop-jobqueue to use rdb1011. T281217 [13:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:09] T281217: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 [13:54:35] (03CR) 10Jbond: Icinga: use parent switch shortname (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767774 (https://phabricator.wikimedia.org/T302940) (owner: 10Ayounsi) [13:54:36] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:06] (03CR) 10David Caro: [C: 03+2] wmcs: add runbook url to the backup_cinder_volumes alert [puppet] - 10https://gerrit.wikimedia.org/r/767467 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [13:55:10] (03CR) 10David Caro: [C: 03+2] wmcs-cinder-backup-manager: increase individual timeout to 30h [puppet] - 10https://gerrit.wikimedia.org/r/767474 (https://phabricator.wikimedia.org/T302855) (owner: 10David Caro) [13:57:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T302950)', diff saved to https://phabricator.wikimedia.org/P21794 and previous config saved to /var/cache/conftool/dbconfig/20220303-135737-ladsgroup.json [13:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:40] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [13:57:43] 10SRE, 10Traffic, 10WMF-General-or-Unknown: Failure to produce an image at specified resolution - https://phabricator.wikimedia.org/T302979 (10Zabe) [13:58:10] 10SRE, 10Thumbor, 10Traffic, 10WMF-General-or-Unknown: Failure to produce an image at specified resolution - https://phabricator.wikimedia.org/T302979 (10Zabe) [13:59:15] (03CR) 10David Caro: ""Someone with ChanServ founder access "" [puppet] - 10https://gerrit.wikimedia.org/r/767790 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [13:59:21] (03CR) 10David Caro: [C: 03+1] alertmanager: add basic wmcs routing rules [puppet] - 10https://gerrit.wikimedia.org/r/767790 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [14:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: Dear deployers, time to do the UTC afternoon backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220303T1400). [14:00:05] sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:18] Here I am [14:00:20] o/ [14:00:24] o/ [14:00:25] (03PS2) 10Ayounsi: Icinga: use parent switch shortname [puppet] - 10https://gerrit.wikimedia.org/r/767774 (https://phabricator.wikimedia.org/T302940) [14:01:23] I sure want that sticker [14:02:01] Lucas_WMDE: are you deploying or should I? [14:02:13] I’m trying to see if I can remember / figure out how a full scap goes [14:02:19] since I assume we’d need that due to the en.json [14:02:38] oh right, it indeed does contain translation changes [14:02:54] yes [14:03:06] why exactly are we backporting this? [14:03:21] (03CR) 10Ayounsi: [C: 03+2] Icinga: add icons to Juniper devices [puppet] - 10https://gerrit.wikimedia.org/r/767773 (owner: 10Ayounsi) [14:03:27] It's for an event scheduled for next Monday in Argentina [14:03:56] Part of broader LATM campaigns that will beign next month [14:03:58] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons. [14:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:35] !log upgraded spicerack to v2.1.0 on cumin1001/cumin2002 [14:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:24] hmm [14:06:00] * Lucas_WMDE reads through https://wm-bot.wmcloud.org/browser/index.php?start=01%2F19%2F2022&end=01%2F19%2F2022&display=%23wikimedia-operations where someone explained scap sync-world to me ^^ [14:06:06] I haven't done a translation backport previously and don't want to be doing one now without someone experienced with them around [14:07:45] I think I feel okay deploying this [14:07:45] sure, I understand. Let me see if urbanecm is around. Otherwise I can try to backport it in the late window [14:08:11] it’s a somewhat strange change imho, but since it was already +2ed on master, I think it’s okay [14:08:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] GLAM event: Update landing page content [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [14:08:43] it's a no-op change until it reaches group2 since it will only be avaialble for eswiki [14:08:58] (03CR) 10Ayounsi: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/34056/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/767774 (https://phabricator.wikimedia.org/T302940) (owner: 10Ayounsi) [14:09:22] Lucas_WMDE: if you feel okay about it let's do it! [14:09:44] sergi0: does that mean it can’t be tested on mwdebug either? [14:10:25] ~ 14 min ci + ~ 20 min sync. fun. [14:10:39] right [14:10:41] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10MatthewVernon) Hi folks - I think @cmooney 's testing is blocked on a new cage being ready - is there a phab ticket for that (that this ticket could be linke... [14:10:49] good thing there’s nothing else in the window [14:11:03] a full scap is much faster when you're only touching one extension and one lang file inside it [14:13:23] (03Abandoned) 10David Caro: tools-clush-generator: add the shorter webgrid names [puppet] - 10https://gerrit.wikimedia.org/r/731114 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [14:14:30] (03PS4) 10Jbond: R:monitoring::host: clean up image selection [puppet] - 10https://gerrit.wikimedia.org/r/767786 [14:14:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet [14:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:11] (03CR) 10jerkins-bot: [V: 04-1] R:monitoring::host: clean up image selection [puppet] - 10https://gerrit.wikimedia.org/r/767786 (owner: 10Jbond) [14:17:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet [14:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:17] (03PS5) 10Jbond: R:monitoring::host: clean up image selection [puppet] - 10https://gerrit.wikimedia.org/r/767786 [14:24:50] (03CR) 10Jbond: [C: 03+1] redfish: fix default value [software/spicerack] - 10https://gerrit.wikimedia.org/r/767789 (owner: 10Volans) [14:25:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767774 (https://phabricator.wikimedia.org/T302940) (owner: 10Ayounsi) [14:25:25] (03CR) 10Volans: [C: 03+2] redfish: fix default value [software/spicerack] - 10https://gerrit.wikimedia.org/r/767789 (owner: 10Volans) [14:25:53] (03CR) 10Ayounsi: [C: 03+2] Icinga: use parent switch shortname [puppet] - 10https://gerrit.wikimedia.org/r/767774 (https://phabricator.wikimedia.org/T302940) (owner: 10Ayounsi) [14:26:08] !log merge Icinga: use parent switch shortname [14:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/767607 (https://phabricator.wikimedia.org/T276198) (owner: 10Cwhite) [14:26:30] (03PS3) 10Ayounsi: Icinga: use parent switch shortname [puppet] - 10https://gerrit.wikimedia.org/r/767774 (https://phabricator.wikimedia.org/T302940) [14:27:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34058/console" [puppet] - 10https://gerrit.wikimedia.org/r/767786 (owner: 10Jbond) [14:28:47] (03Merged) 10jenkins-bot: GLAM event: Update landing page content [extensions/GrowthExperiments] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767690 (https://phabricator.wikimedia.org/T301097) (owner: 10Kosta Harlan) [14:30:35] alright, the change is on mwdebug1001 [14:30:50] sergi0: can you check that the wiki doesn’t break, at least? [14:31:01] (I’m in a meeting, unfortunately, so I only have half attention right now) [14:31:09] yes. Let me create a couple of accounts [14:31:42] (03Merged) 10jenkins-bot: redfish: fix default value [software/spicerack] - 10https://gerrit.wikimedia.org/r/767789 (owner: 10Volans) [14:34:55] Lucas_WMDE: All seems fine, from the few I can test [14:35:09] ok thanks [14:37:10] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport: [[gerrit:767690|GLAM event: Update landing page content (T301097)]] (full sync because of i18n change) [14:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:14] T301097: Account creation: GLAM event landing page - https://phabricator.wikimedia.org/T301097 [14:37:16] alright, let’s go… [14:37:34] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [14:38:25] XioNoX: ^^ icinga errors, probably related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/767774/? [14:38:45] thanks taavi, on it [14:38:52] maybe it just needs a full puppet run everywhere for the exported resources to be updated? [14:39:06] yeah that's exactly it [14:39:08] taavi: yeah exactly [14:39:35] (03Abandoned) 10Jbond: R:monitoring::host: clean up image selection [puppet] - 10https://gerrit.wikimedia.org/r/767786 (owner: 10Jbond) [14:39:38] * godog shakes fist in exported resources' general direction [14:41:06] RECOVERY - SSH on dns5001.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:41:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767608 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [14:42:30] (03PS1) 10Milimetric: role::common::aqs: update mw history in both places [puppet] - 10https://gerrit.wikimedia.org/r/767792 [14:44:33] (03CR) 10Ottomata: [C: 03+2] role::common::aqs: update mw history in both places [puppet] - 10https://gerrit.wikimedia.org/r/767792 (owner: 10Milimetric) [14:45:58] (03PS1) 10Ottomata: Not yet in druid. [puppet] - 10https://gerrit.wikimedia.org/r/767808 [14:46:04] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Not yet in druid. [puppet] - 10https://gerrit.wikimedia.org/r/767808 (owner: 10Ottomata) [14:46:29] (03PS1) 10Ottomata: Should be merged after dataset in druid. [puppet] - 10https://gerrit.wikimedia.org/r/767809 [14:46:56] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport: [[gerrit:767690|GLAM event: Update landing page content (T301097)]] (full sync because of i18n change) (duration: 09m 45s) [14:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:01] T301097: Account creation: GLAM event landing page - https://phabricator.wikimedia.org/T301097 [14:48:09] !log UTC afternoon backport window done [14:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:11] (03PS8) 10Jbond: C:apereo_cas: make session configurable [puppet] - 10https://gerrit.wikimedia.org/r/767765 [14:48:17] (03PS3) 10Jbond: O:idp_test: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767779 [14:48:30] (03PS4) 10Jbond: O:idp_test: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767779 [14:48:39] (03PS3) 10Jbond: O:idp: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767780 [14:48:43] !log force a puppet run on cp6011 to unblock icinga and disable puppet again, cc bblack [14:48:43] (03CR) 10jerkins-bot: [V: 04-1] C:apereo_cas: make session configurable [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [14:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:51] (03PS4) 10Jbond: O:idp: update same site policy and disale pin to session [puppet] - 10https://gerrit.wikimedia.org/r/767780 [14:49:06] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for STHart - https://phabricator.wikimedia.org/T302929 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [14:49:17] Lucas_WMDE: Thank you for the deploy! [14:49:22] np :) [14:49:42] sorry about the i18n change, not sure why was that required.. I will check with the team to avoid in the future. ty! [14:59:16] (03CR) 10BryanDavis: [C: 03+1] alertmanager: add basic wmcs routing rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767790 (https://phabricator.wikimedia.org/T302493) (owner: 10Majavah) [14:59:22] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [15:08:12] !log T296022 - phabricator - disabled git cloning over ssh for 'stewardscripts' repo - stewards have been asked via mailing list [15:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:16] T296022: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 [15:08:42] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10LSobanski) @MoritzMuehlenhoff both DB and Backup tooling work is completed so at this point we are ready to go ahead and upgrade the Cumin hosts. [15:08:57] (03PS1) 10Gerrit maintenance bot: db1148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767795 (https://phabricator.wikimedia.org/T302950) [15:12:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: (Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) p:05Medium→03High @nskaggs, These hosts were ordered without a fully filed racking task (I meant to do it before order),... [15:12:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: (Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) [15:14:44] PROBLEM - SSH on aqs1009.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:14:45] (03CR) 10Ladsgroup: [C: 03+2] db1148: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767795 (https://phabricator.wikimedia.org/T302950) (owner: 10Gerrit maintenance bot) [15:15:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) [15:18:07] (03PS1) 10Gerrit maintenance bot: db1147: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/767797 (https://phabricator.wikimedia.org/T302950) [15:18:18] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/767772 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [15:20:48] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10MoritzMuehlenhoff) Ack, thanks! We'll probably go ahead with this next week, first by removing cumin2001 and then reimaging cumin1001. [15:21:09] !log restarting FPM/Apache on mw job runners to pick up expat security updates [15:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:15] (03CR) 10Cathal Mooney: [C: 03+2] Add subnet definitions for new Analytics vlans to netops data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/767772 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [15:22:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [15:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [15:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T302950)', diff saved to https://phabricator.wikimedia.org/P21798 and previous config saved to /var/cache/conftool/dbconfig/20220303-152242-ladsgroup.json [15:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:45] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [15:24:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10RobH) [15:26:04] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:22] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Jclark-ctr) name rack Unit Port CableID ms-be1068 e1 1U 1 20220289 ms-be1069 e2 1U 1 20220290 ms-be1070 f1 1U 1 20220279 ms-be1071 f2 1U 1 20220280 [15:31:53] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [15:32:10] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install ms-be10[68-71] - https://phabricator.wikimedia.org/T299462 (10Jclark-ctr) [15:36:01] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10Ladsgroup) FWIW, I asked for this when I was not SRE. One complicating factor is that netbox contains serial number of hardware we have a... [15:44:12] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10netbox: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10Dzahn) Are hardware serial numbers more abusable / serious than other things we give NDAed people, like logstash, piwik and the other thi... [15:45:03] (03PS8) 10Filippo Giunchedi: logstash: add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) [15:46:02] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:46:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1148.eqiad.wmnet with OS bullseye [15:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:55] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:44] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: add blackbox-exporter filter config [puppet] - 10https://gerrit.wikimedia.org/r/765476 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:53:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1148.eqiad.wmnet with reason: host reimage [15:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:19] 10SRE, 10Observability-Metrics, 10Traffic: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10MMandere) [16:01:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1148.eqiad.wmnet with reason: host reimage [16:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:48] (03CR) 10Muehlenhoff: "No need to duplicate the sync definition, gitlab-runner only has minimal unversioned deps (git, curl, tar, ca-certificates), so you can si" [puppet] - 10https://gerrit.wikimedia.org/r/767604 (https://phabricator.wikimedia.org/T297659) (owner: 10Dzahn) [16:08:59] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767607 (https://phabricator.wikimedia.org/T276198) (owner: 10Cwhite) [16:10:44] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:11:41] (03PS1) 10Ottomata: Bump changelong for including latest workflow_utils [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/767830 [16:15:42] RECOVERY - SSH on aqs1009.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:17:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1148.eqiad.wmnet with OS bullseye [16:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:39] (03PS2) 10Muehlenhoff: envoy-hot-restart: Switch shebang to /usr/bin/python3 [puppet] - 10https://gerrit.wikimedia.org/r/767536 [16:20:25] (03PS1) 10Ssingh: certspotter: fix certspotter's signal-to-noise ratio (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/767832 [16:23:25] (03PS11) 10Filippo Giunchedi: Introduce 'alertmanager' and 'alerting' modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) [16:24:50] (03CR) 10Filippo Giunchedi: "PTAL" [software/spicerack] - 10https://gerrit.wikimedia.org/r/765480 (https://phabricator.wikimedia.org/T293209) (owner: 10Filippo Giunchedi) [16:26:46] (03PS1) 10Hnowlan: Revert "api-gateway: allow discovery services to set custom rate limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/767810 [16:27:19] (03PS2) 10Ssingh: certspotter: fix certspotter's signal-to-noise ratio (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/767832 [16:27:27] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for uwsgi-netbox-scriptproxy [puppet] - 10https://gerrit.wikimedia.org/r/767834 (https://phabricator.wikimedia.org/T135991) [16:27:40] (03Abandoned) 10Alexandros Kosiaris: Have rdb1012 replicate from rdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/682893 (https://phabricator.wikimedia.org/T281217) (owner: 10Legoktm) [16:28:29] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for uwsgi-netbox-scriptproxy [puppet] - 10https://gerrit.wikimedia.org/r/767834 (https://phabricator.wikimedia.org/T135991) [16:30:57] !log roll-restart logstash to pick up config changes - T291946 [16:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:01] T291946: Move service::catalog checks (“monitoring” section) to blackbox exporter and Alertmanager - https://phabricator.wikimedia.org/T291946 [16:32:06] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:21] 10SRE, 10ops-codfw, 10decommission-hardware: decommission ganeti2007 - https://phabricator.wikimedia.org/T302577 (10Papaul) [16:35:24] (03CR) 10Muehlenhoff: C:apereo_cas: make session configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [16:35:57] 10SRE, 10ops-codfw, 10decommission-hardware: decommission ganeti2007 - https://phabricator.wikimedia.org/T302577 (10Papaul) 05Open→03Resolved complete [16:37:27] (03PS2) 10Hnowlan: Revert "api-gateway: allow discovery services to set custom rate limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/767810 [16:37:42] 10SRE, 10ops-codfw, 10decommission-hardware: decommission ganeti2008 - https://phabricator.wikimedia.org/T302578 (10Papaul) [16:38:51] (03PS1) 10Cathal Mooney: Add new eqiad switches to monitoring and align for all L3 switches [puppet] - 10https://gerrit.wikimedia.org/r/767835 (https://phabricator.wikimedia.org/T302940) [16:40:12] (03PS2) 10Ottomata: Bump changelong for including latest workflow_utils [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/767830 [16:41:34] (03PS1) 10Cwhite: logstash: re-enable service restart on config changes [puppet] - 10https://gerrit.wikimedia.org/r/767836 [16:42:22] (03PS9) 10Jbond: C:apereo_cas: make session configurable [puppet] - 10https://gerrit.wikimedia.org/r/767765 [16:42:47] (03PS3) 10Ssingh: certspotter: fix certspotter's signal-to-noise ratio (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/767832 [16:42:53] (03CR) 10jerkins-bot: [V: 04-1] C:apereo_cas: make session configurable [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [16:43:07] (03CR) 10Jbond: C:apereo_cas: make session configurable (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [16:44:15] (03PS10) 10Jbond: C:apereo_cas: make session configurable [puppet] - 10https://gerrit.wikimedia.org/r/767765 [16:44:19] (03CR) 10Hnowlan: [C: 03+2] Revert "api-gateway: allow discovery services to set custom rate limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/767810 (owner: 10Hnowlan) [16:44:28] (03CR) 10Herron: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767836 (owner: 10Cwhite) [16:44:34] (03PS1) 10Ayounsi: Icinga: Add Juniper image to IPv6 items [puppet] - 10https://gerrit.wikimedia.org/r/767838 [16:44:36] (03PS1) 10Ayounsi: Icinga: refactor network monitoring [puppet] - 10https://gerrit.wikimedia.org/r/767839 [16:44:53] (03PS4) 10Ssingh: certspotter: fix certspotter's signal-to-noise ratio (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/767832 [16:48:19] (03CR) 10RLazarus: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/767536 (owner: 10Muehlenhoff) [16:48:50] (03Merged) 10jenkins-bot: Revert "api-gateway: allow discovery services to set custom rate limits" [deployment-charts] - 10https://gerrit.wikimedia.org/r/767810 (owner: 10Hnowlan) [16:49:14] RECOVERY - Check systemd state on cp6010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T302950)', diff saved to https://phabricator.wikimedia.org/P21799 and previous config saved to /var/cache/conftool/dbconfig/20220303-165116-ladsgroup.json [16:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:20] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [16:52:58] (03PS5) 10Ssingh: certspotter: fix certspotter's signal-to-noise ratio (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/767832 [16:53:33] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [16:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:53] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [16:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:50] (03CR) 10Ayounsi: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/34064/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/767839 (owner: 10Ayounsi) [17:00:05] jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220303T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:58] jouncebot: s/PHP/Puppet declarative language/ ? ;) [17:03:20] we have all kinds of hammers and all kinds of thumbs [17:03:37] wonder if puppet request window is still used much [17:03:45] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [17:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:56] it is! more than it used to be I think [17:04:09] oh! ok [17:04:37] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [17:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:50] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [17:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P21800 and previous config saved to /var/cache/conftool/dbconfig/20220303-170621-ladsgroup.json [17:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:41] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [17:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:26] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:15:05] (03CR) 10Jbond: [C: 03+1] envoy-hot-restart: Switch shebang to /usr/bin/python3 [puppet] - 10https://gerrit.wikimedia.org/r/767536 (owner: 10Muehlenhoff) [17:16:32] (03CR) 10Ayounsi: [C: 04-1] "Replaced by new CRs" [puppet] - 10https://gerrit.wikimedia.org/r/764791 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [17:21:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P21801 and previous config saved to /var/cache/conftool/dbconfig/20220303-172125-ladsgroup.json [17:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:06] (03PS1) 10Jcrespo: Refactor check_mariadb_backups.py and add enough tests for it [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 [17:23:51] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Unrack wmf3570, wmf4579, conf1003 - https://phabricator.wikimedia.org/T302034 (10wiki_willy) [17:25:32] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10cmooney) @MatthewVernon hey! My apologies I was supposed to feed back before now. We should be good to go ms-fe1012 now, there are a few other servers rack... [17:25:37] (03PS2) 10Ayounsi: Icinga: refactor network monitoring [puppet] - 10https://gerrit.wikimedia.org/r/767839 [17:26:00] (03PS2) 10Jcrespo: Refactor check_mariadb_backups.py and add enough tests for it [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 [17:26:34] (03CR) 10jerkins-bot: [V: 04-1] Icinga: refactor network monitoring [puppet] - 10https://gerrit.wikimedia.org/r/767839 (owner: 10Ayounsi) [17:26:50] 10SRE, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10akosiaris) >>! In T274388#7744335, @MSantos wrote: >> Set up the traffic layer to send traffic to the service (if needed). This is a bit... [17:27:29] (03PS3) 10Ayounsi: Icinga: refactor network monitoring [puppet] - 10https://gerrit.wikimedia.org/r/767839 [17:28:34] (03PS2) 10Tchanders: Enable IPInfo on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767216 (https://phabricator.wikimedia.org/T260598) [17:28:36] (03PS1) 10Tchanders: Autopromote-once users to the 'ipinfo' group after one edit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) [17:32:08] (03CR) 10Tchanders: Autopromote-once users to the 'ipinfo' group after one edit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767845 (https://phabricator.wikimedia.org/T296184) (owner: 10Tchanders) [17:32:51] (03CR) 10Jcrespo: "This tries to use wmfmariadbpy.dbutil.read_section_ports_list() but there are some issues there, see line from comment. Feedback welcome." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/767844 (owner: 10Jcrespo) [17:33:26] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Will abandon my one thanks." [puppet] - 10https://gerrit.wikimedia.org/r/767839 (owner: 10Ayounsi) [17:33:56] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/34066/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/767839 (owner: 10Ayounsi) [17:34:52] (03Abandoned) 10Cathal Mooney: Add new eqiad switches to monitoring and align for all L3 switches [puppet] - 10https://gerrit.wikimedia.org/r/767835 (https://phabricator.wikimedia.org/T302940) (owner: 10Cathal Mooney) [17:36:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T302950)', diff saved to https://phabricator.wikimedia.org/P21802 and previous config saved to /var/cache/conftool/dbconfig/20220303-173630-ladsgroup.json [17:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:34] T302950: Upgrade s4 to bullseye - https://phabricator.wikimedia.org/T302950 [17:36:53] (03CR) 10Ayounsi: [C: 03+2] Icinga: Add Juniper image to IPv6 items [puppet] - 10https://gerrit.wikimedia.org/r/767838 (owner: 10Ayounsi) [17:37:09] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Icinga: refactor network monitoring [puppet] - 10https://gerrit.wikimedia.org/r/767839 (owner: 10Ayounsi) [17:37:24] (03PS10) 10Krinkle: Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 (owner: 10Thiemo Kreuz (WMDE)) [17:38:19] merging another "big" icinga change [17:38:22] (03CR) 10Krinkle: [C: 03+2] Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 (owner: 10Thiemo Kreuz (WMDE)) [17:39:45] (03Merged) 10jenkins-bot: Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 (owner: 10Thiemo Kreuz (WMDE)) [17:43:26] (03CR) 10Ottomata: [C: 03+2] Should be merged after dataset in druid. [puppet] - 10https://gerrit.wikimedia.org/r/767809 (owner: 10Ottomata) [17:45:55] (03CR) 10Ryan Kemper: [C: 03+2] query_service: pass cookies on to blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/765667 (https://phabricator.wikimedia.org/T293462) (owner: 10Ebernhardson) [17:47:09] !log otto@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [17:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:55] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [17:47:57] (03PS3) 10Ryan Kemper: query_service: Include scheme and host in X-redirect-url [puppet] - 10https://gerrit.wikimedia.org/r/767259 (owner: 10Ebernhardson) [17:48:24] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767834 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:48:38] !log otto@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [17:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:59] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] query_service: Include scheme and host in X-redirect-url [puppet] - 10https://gerrit.wikimedia.org/r/767259 (owner: 10Ebernhardson) [17:49:16] !log otto@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [17:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:34] !log otto@cumin1001 END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [17:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:05] (03PS1) 10Ottomata: Rename cumin alias aqs-next to aqs [puppet] - 10https://gerrit.wikimedia.org/r/767853 (https://phabricator.wikimedia.org/T302278) [17:56:40] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Unrack wmf3570, wmf4579, conf1003, mw1301 - https://phabricator.wikimedia.org/T302034 (10wiki_willy) [17:57:24] (03CR) 10Elukey: [C: 03+1] Rename cumin alias aqs-next to aqs [puppet] - 10https://gerrit.wikimedia.org/r/767853 (https://phabricator.wikimedia.org/T302278) (owner: 10Ottomata) [17:58:17] !log krinkle@deploy1002 Synchronized wmf-config/: Idf7b21159423 (duration: 00m 51s) [17:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:27] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/767853 (https://phabricator.wikimedia.org/T302278) (owner: 10Ottomata) [17:58:36] (03CR) 10Ottomata: [C: 03+2] Rename cumin alias aqs-next to aqs [puppet] - 10https://gerrit.wikimedia.org/r/767853 (https://phabricator.wikimedia.org/T302278) (owner: 10Ottomata) [17:59:38] !log otto@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [17:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:10] (03PS5) 10Ryan Kemper: search-platform: Port alerts from icinga [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [18:00:51] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Wrong filenames in the File history section (timestamp differs from displayed timestamp) - https://phabricator.wikimedia.org/T302985 (10Aklapper) 05Stalled→03Open [18:02:08] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: updating wmf-puppet-dashboard [18:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:36] !log otto@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [18:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:56] (03CR) 10Ryan Kemper: [C: 03+2] search-platform: Port alerts from icinga [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [18:06:25] (03Merged) 10jenkins-bot: search-platform: Port alerts from icinga [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [18:11:21] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: updating wmf-puppet-dashboard (duration: 09m 12s) [18:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:12] (03PS2) 10Ryan Kemper: wdqs/elastic: Remove icinga checks after moving to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/766834 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [18:26:17] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/766834 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [18:29:28] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:02] (03CR) 10Ryan Kemper: [C: 03+2] wdqs/elastic: Remove icinga checks after moving to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/766834 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [18:32:45] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/767765 (owner: 10Jbond) [18:39:29] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:32] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye [18:42:44] (03PS1) 10Cathal Mooney: Add several ASNs to those that alert as critical from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/767862 (https://phabricator.wikimedia.org/T299758) [18:45:05] (03PS1) 10Jdlrobson: Remove user navigation min width and width [skins/MinervaNeue] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767811 (https://phabricator.wikimedia.org/T302753) [18:48:13] (03PS1) 10Brennen Bearnes: Unset data-toc in SkinVector [skins/Vector] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767812 (https://phabricator.wikimedia.org/T302461) [18:49:15] 10SRE, 10SRE Observability (FY2021/2022-Q3): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10herron) To take a step back, the varnish slo dashboard linked in the description didn't actually originate from a template. Presumably this one was a manual fork of the original etcd slo examp... [18:50:17] PROBLEM - Check systemd state on kubernetes2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:21] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage [18:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1007.eqiad.wmnet with reason: host reimage [18:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:45] PROBLEM - Check systemd state on kubernetes1011 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:56:04] hi brennen [18:56:25] hey Jdlrobson [18:56:31] shall we go ahead and do that backport? [18:56:58] I need to revise that patch [18:57:02] so need a few more minutes [18:57:31] Jdlrobson: kk, take your time. [18:57:57] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:57:59] will hold train 'til then. [18:58:32] brennen: I'd also like to backport https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/767811 [18:58:44] although not UBN it looks pretty bad. [18:58:52] I'll use the window this afternoon if that's easier for you [19:00:04] brennen and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220303T1900). [19:01:35] Jdlrobson: let's not block on that one, but if it's ready before the backport window i can help you get it out. [19:02:30] it's ready now [19:02:40] (merged to master) [19:02:49] just needs backporting to wmf24. [19:03:17] Ideally it would go out before the train as going out at 1pm would have issues with cached HTML [19:03:24] the other patch should be ready shortly [19:03:25] kk, let's go ahead and do that then. [19:03:31] just finished my review with Nick [19:03:37] (nray ) [19:04:21] (03CR) 10Brennen Bearnes: [C: 03+2] Remove user navigation min width and width [skins/MinervaNeue] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767811 (https://phabricator.wikimedia.org/T302753) (owner: 10Jdlrobson) [19:05:01] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dumpsdata1007.eqiad.wmnet with OS bullseye [19:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:04] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin1001 for host dumpsdata1007.eqiad.wmnet with OS bullseye completed: - dumpsdata1007 (**PASS**) - Removed from Puppet and PuppetDB if presen... [19:06:11] (03PS2) 10Jdlrobson: Unset data-toc in SkinVector [skins/Vector] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767812 (https://phabricator.wikimedia.org/T302461) (owner: 10Brennen Bearnes) [19:06:27] (03CR) 10Jdlrobson: [C: 03+1] "Okay Brennen this one is also ready to go!" [skins/Vector] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767812 (https://phabricator.wikimedia.org/T302461) (owner: 10Brennen Bearnes) [19:07:35] (03CR) 10Brennen Bearnes: [C: 03+2] Unset data-toc in SkinVector [skins/Vector] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767812 (https://phabricator.wikimedia.org/T302461) (owner: 10Brennen Bearnes) [19:09:10] Jdlrobson, nray - ok, those are in queue; waiting on CI. i'll let you know when they're on a debug box. [19:11:01] brennen: sounds good [19:13:43] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, and 2 others: Kafka 2.x Upgrade Plan - https://phabricator.wikimedia.org/T302610 (10odimitrijevic) [19:14:29] RECOVERY - Check systemd state on kubernetes1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:20] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10RobH) Ok, this is now installed. However, I have a single raid1 of the 2 SSDS, but the megacli app doesn't see this? It can read controller info though, so its an inconsistent feedback. ` robh@dumpsdata1007:~$ sudo... [19:16:04] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) >>! In T302937#7751488, @RobH wrote: > Ok, this is now installed. However, I have a single raid1 of the 2 SSDS, but the megacli app doesn't see this? > > It can read controller info though, so it... [19:16:38] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10observability, and 2 others: Kafka 2.x Upgrade Plan - https://phabricator.wikimedia.org/T302610 (10odimitrijevic) @elukey I updated the task description as I ask for it :) [19:16:51] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) a:05RobH→03MoritzMuehlenhoff Moritz, Can I get your feedback on the above perhaps? Is this due to older versions of megacli by chance or am I missing something about why this doesn't see the... [19:17:29] RECOVERY - Check systemd state on kubernetes2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:15] (03Merged) 10jenkins-bot: Remove user navigation min width and width [skins/MinervaNeue] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767811 (https://phabricator.wikimedia.org/T302753) (owner: 10Jdlrobson) [19:19:49] Jdlrobson: https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/767811 is on mwdebug1002 [19:20:01] brennen: looking [19:20:50] (03CR) 10jerkins-bot: [V: 04-1] Unset data-toc in SkinVector [skins/Vector] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767812 (https://phabricator.wikimedia.org/T302461) (owner: 10Brennen Bearnes) [19:21:02] (03CR) 10Ayounsi: "Thanks, lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/767862 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [19:21:40] Minerva patch is good to sync brennen [19:22:09] Vector CI issue is https://phabricator.wikimedia.org/T299780 [19:22:17] so we'll have to try that again [19:22:20] (03PS1) 10Jbond: C:reposync: Add require on git clone command [puppet] - 10https://gerrit.wikimedia.org/r/767869 [19:22:52] Jdlrobson: ack, syncing minerva patch. [19:23:18] (03CR) 10jerkins-bot: [V: 04-1] C:reposync: Add require on git clone command [puppet] - 10https://gerrit.wikimedia.org/r/767869 (owner: 10Jbond) [19:23:50] (03Merged) 10jenkins-bot: Unset data-toc in SkinVector [skins/Vector] (wmf/1.38.0-wmf.24) - 10https://gerrit.wikimedia.org/r/767812 (https://phabricator.wikimedia.org/T302461) (owner: 10Brennen Bearnes) [19:23:56] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.24/skins/MinervaNeue/resources/skins.minerva.base.styles/userMenu.less: Backport: [[gerrit:767811|Remove user navigation min width and width (T302753)]] (duration: 00m 51s) [19:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:00] T302753: [Regression, wmf.24-mobile] Search icon is misplaced - https://phabricator.wikimedia.org/T302753 [19:24:23] (03PS2) 10Jbond: C:reposync: Add require on git clone command [puppet] - 10https://gerrit.wikimedia.org/r/767869 [19:24:59] (03CR) 10jerkins-bot: [V: 04-1] C:reposync: Add require on git clone command [puppet] - 10https://gerrit.wikimedia.org/r/767869 (owner: 10Jbond) [19:25:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34068/console" [puppet] - 10https://gerrit.wikimedia.org/r/767869 (owner: 10Jbond) [19:26:21] (03PS3) 10Jbond: C:reposync: Add require on git clone command [puppet] - 10https://gerrit.wikimedia.org/r/767869 [19:26:28] Jdlrobson: vector patch is on mwdebug1002 [19:26:45] brennen: looking [19:27:27] (03CR) 10Jbond: [C: 03+2] C:reposync: Add require on git clone command [puppet] - 10https://gerrit.wikimedia.org/r/767869 (owner: 10Jbond) [19:28:02] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:28:42] brennen: LGTM! [19:28:52] syncing [19:30:12] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.24/skins/Vector/includes/SkinVector.php: Backport: [[gerrit:767812|Unset data-toc in SkinVector (T302461)]] (duration: 00m 49s) [19:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:16] T302461: Empty TOC in New Vector - https://phabricator.wikimedia.org/T302461 [19:32:05] !log 1.38.0-wmf.24 train (T300200): no current blockers; proceeding to all wikis [19:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:08] T300200: 1.38.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T300200 [19:32:53] (03PS6) 10Ssingh: certspotter: fix certspotter's signal-to-noise ratio (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/767832 [19:32:58] (03PS1) 10Brennen Bearnes: all wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767871 [19:33:02] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767871 (owner: 10Brennen Bearnes) [19:33:20] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:51] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.24 refs T300200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767871 (owner: 10Brennen Bearnes) [19:34:56] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34069/console" [puppet] - 10https://gerrit.wikimedia.org/r/767832 (owner: 10Ssingh) [19:35:21] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.24 refs T300200 [19:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:53] thanks brennen! glad to see the train running again [19:36:09] Jdlrobson: thank you! appreciate the speedy assist. [19:36:30] now to wait for the inevitable group2 breakage... :) [19:38:36] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:02] hmkm [19:41:27] (03PS1) 10Jbond: C:reposync: update to use exec git init [puppet] - 10https://gerrit.wikimedia.org/r/767873 [19:42:04] (03CR) 10jerkins-bot: [V: 04-1] C:reposync: update to use exec git init [puppet] - 10https://gerrit.wikimedia.org/r/767873 (owner: 10Jbond) [19:42:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34070/console" [puppet] - 10https://gerrit.wikimedia.org/r/767873 (owner: 10Jbond) [19:42:20] Pinging mutante to see if I can help recover from that alert from deploy1002. I can't run `journalctl -u deploy_to_mwdebug` [19:43:54] dancy: not that much to see -- checking the named file now https://www.irccloud.com/pastebin/M1lqLEyH/ [19:44:21] 10SRE, 10DC-Ops: datadumps1007 test installs - https://phabricator.wikimedia.org/T302937 (10cmooney) [19:44:28] 10SRE, 10DC-Ops, 10Infrastructure-Foundations, 10netops, 10observability: icinga config error for new rows E/R - https://phabricator.wikimedia.org/T302940 (10cmooney) 05Open→03Resolved dumpsdata1007 looks good in Icinga now after being re-added, following the above patches being merged. Apologies fo... [19:44:46] dancy: hm, not much to see there either https://www.irccloud.com/pastebin/8sf9ByHe/ [19:45:02] rzl Thanks. There were no newer journal entries? [19:45:29] e.g. around 17:44:21 ? [19:45:32] oops sorry, I must have grabbed the wrong one -- there are, newer timestamps but same text [19:45:43] https://www.irccloud.com/pastebin/X05QpWiH/ [19:46:24] hm note that datestamp in /error is from a couple days ago [19:46:37] ah, indeed. [19:46:43] I'm not familiar, I guess this is saying it's been wedged since that time [19:47:01] Agreed. Can you remove /var/lib/deploy-mwdebug/error plz [19:47:02] aha yes and there's a longer log entry in the journal from Mar 01 17:44:21 [19:47:08] (03PS2) 10Jbond: C:reposync: update to use exec git init [puppet] - 10https://gerrit.wikimedia.org/r/767873 [19:47:39] dancy: done [19:48:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34071/console" [puppet] - 10https://gerrit.wikimedia.org/r/767873 (owner: 10Jbond) [19:48:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:46] and there it goes. [19:48:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:reposync: update to use exec git init [puppet] - 10https://gerrit.wikimedia.org/r/767873 (owner: 10Jbond) [19:49:30] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:00] Harmony restored. [19:50:04] Thanks rzl [19:50:40] \o/ [19:55:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:18] (03PS1) 10Jbond: C:reposync: update git init command [puppet] - 10https://gerrit.wikimedia.org/r/767875 [19:57:49] (03CR) 10Jbond: [C: 03+2] C:reposync: update git init command [puppet] - 10https://gerrit.wikimedia.org/r/767875 (owner: 10Jbond) [20:03:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:11] (03PS2) 10Krinkle: misc: search-grafana-dashboards.js [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [20:14:30] (03CR) 10Krinkle: "Updated and added source link. It was in a repo of sorts and has broken/required changes since." [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [20:14:35] (03CR) 10Krinkle: [C: 03+1] misc: search-grafana-dashboards.js [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [20:15:07] (03PS7) 10Ssingh: certspotter: fix certspotter's signal-to-noise ratio (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/767832 [20:15:19] (03PS3) 10Krinkle: misc: search-grafana-dashboards.js [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [20:15:22] (03CR) 10Krinkle: [C: 03+1] misc: search-grafana-dashboards.js [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [20:16:14] (03CR) 10Krinkle: [C: 03+1] "/me can't merge here, consider this a +2 about the code and whichever OSI-approved license you want. I notice both the repo and this direc" [software] - 10https://gerrit.wikimedia.org/r/767118 (owner: 10Filippo Giunchedi) [20:17:36] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34072/console" [puppet] - 10https://gerrit.wikimedia.org/r/767832 (owner: 10Ssingh) [20:17:48] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:18:09] (03PS8) 10Ssingh: certspotter: fix certspotter's signal-to-noise ratio (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/767832 [20:19:58] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34073/console" [puppet] - 10https://gerrit.wikimedia.org/r/767832 (owner: 10Ssingh) [20:21:36] (03PS1) 10Samtar: changeprop: Remove RESTBase page blacklist [deployment-charts] - 10https://gerrit.wikimedia.org/r/767878 (https://phabricator.wikimedia.org/T274359) [20:22:33] (03PS2) 10Ryan Kemper: opensearch: use separate rundir per instance [puppet] - 10https://gerrit.wikimedia.org/r/767607 (https://phabricator.wikimedia.org/T276198) (owner: 10Cwhite) [20:23:22] (03CR) 10Ryan Kemper: "Regarding the commit message:" [puppet] - 10https://gerrit.wikimedia.org/r/767607 (https://phabricator.wikimedia.org/T276198) (owner: 10Cwhite) [20:26:10] (03PS1) 10Krinkle: tests: Remove unused 'wmfDatacenter' var in cirrusTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767881 [20:26:23] (03CR) 10Krinkle: [C: 03+1] "LGTM, good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766229 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:26:31] (03PS2) 10Krinkle: Write the same value to $wmgDatacenter(s) as to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766229 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:26:57] (03CR) 10Krinkle: [C: 03+2] tests: Remove unused 'wmfDatacenter' var in cirrusTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767881 (owner: 10Krinkle) [20:27:38] (03Merged) 10jenkins-bot: tests: Remove unused 'wmfDatacenter' var in cirrusTest.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767881 (owner: 10Krinkle) [20:27:46] (03PS3) 10Krinkle: Write the same value to $wmgDatacenter(s) as to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766229 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:27:51] (03CR) 10Krinkle: [C: 03+1] Write the same value to $wmgDatacenter(s) as to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766229 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [20:30:16] (03PS1) 10Jbond: P:netbox::automation: Add reposync with netbox-hiera bare repo [puppet] - 10https://gerrit.wikimedia.org/r/767882 [20:31:50] (03PS1) 10Urbanecm: throttle: Add rule for Wikigap 2022 in CZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767883 (https://phabricator.wikimedia.org/T303002) [20:32:39] (03CR) 10jerkins-bot: [V: 04-1] throttle: Add rule for Wikigap 2022 in CZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767883 (https://phabricator.wikimedia.org/T303002) (owner: 10Urbanecm) [20:33:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:44] (03PS1) 10Urbanecm: throttle: Add rule for arwiki Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767885 (https://phabricator.wikimedia.org/T303002) [20:35:05] (03CR) 10jerkins-bot: [V: 04-1] throttle: Add rule for arwiki Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767885 (https://phabricator.wikimedia.org/T303002) (owner: 10Urbanecm) [20:37:11] (03PS2) 10Jbond: P:netbox::automation: Add reposync with netbox-hiera bare repo [puppet] - 10https://gerrit.wikimedia.org/r/767882 [20:37:27] (03PS1) 10Urbanecm: ThrottleTest: Cast strtotime to bool before comparing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767887 [20:37:42] (03PS2) 10Urbanecm: throttle: Add rule for Wikigap 2022 in CZ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767883 (https://phabricator.wikimedia.org/T303002) [20:37:49] (03PS2) 10Urbanecm: throttle: Add rule for arwiki Wikigap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/767885 (https://phabricator.wikimedia.org/T303002) [20:39:18] (03Abandoned) 10Ssingh: certspotter: fix certspotter's signal-to-noise ratio (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/767832 (owner: 10Ssingh) [20:40:05] (03PS3) 10Jbond: P:netbox::automation: Add reposync with netbox-hiera bare repo [puppet] - 10https://gerrit.wikimedia.org/r/767882 [20:41:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:06] (03PS1) 10JHathaway: profile::mirrors: move mirrors module into profiles [puppet] - 10https://gerrit.wikimedia.org/r/767889 (https://phabricator.wikimedia.org/T300985) [20:44:33] (03PS2) 10JHathaway: profile::mirrors: move mirrors module into profiles [puppet] - 10https://gerrit.wikimedia.org/r/767889 (https://phabricator.wikimedia.org/T300985) [20:45:08] PROBLEM - Check systemd state on cp6010 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_exim4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:51] (03CR) 10JHathaway: "Would love to know your thoughts, still wrapping my head around the roles & profiles pattern" [puppet] - 10https://gerrit.wikimedia.org/r/767889 (https://phabricator.wikimedia.org/T300985) (owner: 10JHathaway) [20:47:26] (03PS3) 10Ryan Kemper: opensearch: use separate rundir per instance [puppet] - 10https://gerrit.wikimedia.org/r/767607 (https://phabricator.wikimedia.org/T276198) (owner: 10Cwhite) [20:48:09] (03PS4) 10Jbond: P:netbox::automation: Add reposync with netbox-hiera bare repo [puppet] - 10https://gerrit.wikimedia.org/r/767882 [20:48:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:26] (03PS5) 10Jbond: P:netbox::automation: Add reposync with netbox-hiera bare repo [puppet] - 10https://gerrit.wikimedia.org/r/767882 [20:53:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34080/console" [puppet] - 10https://gerrit.wikimedia.org/r/767882 (owner: 10Jbond) [20:59:15] (03PS6) 10Jbond: P:netbox::automation: Add reposync with netbox-hiera bare repo [puppet] - 10https://gerrit.wikimedia.org/r/767882 [21:00:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34081/console" [puppet] - 10https://gerrit.wikimedia.org/r/767882 (owner: 10Jbond) [21:00:04] brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport and config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220303T2100). [21:00:04] zabe: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:25] o/ [21:01:12] (03CR) 10Bking: [C: 03+2] Upgrade to elasticsearch 7.10.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/763485 (https://phabricator.wikimedia.org/T299226) (owner: 10EJoseph) [21:01:48] o/ [21:01:53] (03PS7) 10Jbond: P:netbox::automation: Add reposync with netbox-hiera bare repo [puppet] - 10https://gerrit.wikimedia.org/r/767882 [21:02:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34082/console" [puppet] - 10https://gerrit.wikimedia.org/r/767882 (owner: 10Jbond) [21:05:13] (03CR) 10Jbond: [V: 03+1] "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/767882 (owner: 10Jbond) [21:05:29] zabe: we're discussing things a bit in the training session, rzl will be conducting your deploy here shortly. [21:06:01] ok :) [21:08:19] (03CR) 10RLazarus: [C: 03+2] "Backporting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766229 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [21:09:10] (03Merged) 10jenkins-bot: Write the same value to $wmgDatacenter(s) as to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/766229 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [21:13:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:57] zabe: this is on mwdebug1001 [21:20:03] (just for clarity, it's back over to brennen after all because my internet's started flaking) [21:20:54] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:21:44] brennen, lgtm, there is nothing really I can test expect checking that nothing breaks. logstash looks clear. [21:23:01] zabe: thx, syncing [21:25:30] zabe: pointers on sync order / dependencies here? MWRealm first? [21:26:52] brennen, sync order doesn't matter, we are only starting to write to a new variable but are not reading from it yet [21:27:01] should be a noop [21:28:31] !log brennen@deploy1002 Synchronized multiversion/MWRealm.php: Config: [[gerrit:766229|Write the same value to $wmgDatacenter(s) as to $wmfDatacenter(s) (T45956)]] (duration: 00m 48s) [21:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:35] T45956: Rename $wmf* to $wmg* in wmf-config - https://phabricator.wikimedia.org/T45956 [21:28:53] !log brennen@deploy1002 Started scap: Config: [[gerrit:766229|Write the same value to $wmgDatacenter(s) as to $wmfDatacenter(s) (T45956)]] [21:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:26] !log brennen@deploy1002 Finished scap: Config: [[gerrit:766229|Write the same value to $wmgDatacenter(s) as to $wmfDatacenter(s) (T45956)]] (duration: 01m 33s) [21:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:58] zabe: {{done}} [21:32:54] thx [21:35:40] !log end of UTC late backport & config window / training [21:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:36] (03CR) 10Ahmon Dancy: static.php: Improve docs and simplify/clarify some code (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [21:47:55] (JobUnavailable) firing: Reduced availability for job jmx_wdqs_updater in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [21:49:32] (03PS4) 10Krinkle: static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) [21:50:13] (03CR) 10Ahmon Dancy: [C: 03+1] static.php: Improve docs and simplify/clarify some code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [21:50:16] (03CR) 10Krinkle: static.php: Improve docs and simplify/clarify some code (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/765355 (https://phabricator.wikimedia.org/T302465) (owner: 10Krinkle) [21:51:03] (03PS1) 10Bking: elastic: add elastic710 component to repos [puppet] - 10https://gerrit.wikimedia.org/r/767899 (https://phabricator.wikimedia.org/T299226) [21:55:47] (03PS2) 10Ryan Kemper: elastic: add elastic710 component [puppet] - 10https://gerrit.wikimedia.org/r/767899 (https://phabricator.wikimedia.org/T299226) (owner: 10Bking) [21:56:19] (03CR) 10Ryan Kemper: [V: 03+2] elastic: add elastic710 component [puppet] - 10https://gerrit.wikimedia.org/r/767899 (https://phabricator.wikimedia.org/T299226) (owner: 10Bking) [21:56:25] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: add elastic710 component [puppet] - 10https://gerrit.wikimedia.org/r/767899 (https://phabricator.wikimedia.org/T299226) (owner: 10Bking) [22:04:04] 10SRE-OnFire, 10Phabricator: Phabricator form request for creation of tasks tagged wikimedia-incident - https://phabricator.wikimedia.org/T303009 (10herron) [22:05:25] 10SRE-OnFire, 10Phabricator: Phabricator form request for creation of tasks tagged wikimedia-incident - https://phabricator.wikimedia.org/T303009 (10herron) [22:07:01] 10SRE-OnFire, 10Phabricator: Phabricator form request for creation of tasks tagged wikimedia-incident - https://phabricator.wikimedia.org/T303009 (10RhinosF1) I don't believe forms have granular edit permissions. If you want something you can edit easily on the fly, you can try phabulous.toolforge.org to gene... [22:10:28] herron: does ^ make sense [22:10:43] RhinosF1: it does thanks, experimenting with that tool now [22:11:51] (03PS1) 10JHathaway: profile::mirrros: switch to apache2 [puppet] - 10https://gerrit.wikimedia.org/r/767903 (https://phabricator.wikimedia.org/T300985) [22:12:37] 10SRE-OnFire, 10Phabricator: Phabricator form request for creation of tasks tagged wikimedia-incident - https://phabricator.wikimedia.org/T303009 (10herron) 05Open→03Resolved a:03herron Neat, thanks @RhinosF1 that should be good enough to get started with this [22:21:28] a heads up that i'm likely rolling back the train here in a moment. [22:22:36] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:39:05] 10SRE-OnFire, 10Phabricator: Phabricator form request for creation of tasks tagged wikimedia-incident - https://phabricator.wikimedia.org/T303009 (10Aklapper) 05Resolved→03Declined No form was created; correcting task status [22:42:36] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 103, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:55:28] (or not) [23:05:24] 10SRE, 10SRE Observability (FY2021/2022-Q3): SLO dashboard refinements - https://phabricator.wikimedia.org/T302842 (10RLazarus) > To take a step back, the varnish slo dashboard linked in the description didn't actually originate from a template. Presumably this one was a manual fork of the original etcd slo ex... [23:14:16] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook