[00:01:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P64136 and previous config saved to /var/cache/conftool/dbconfig/20240606-000151-marostegui.json [00:04:41] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:06:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [00:17:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P64137 and previous config saved to /var/cache/conftool/dbconfig/20240606-001700-marostegui.json [00:32:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T364299)', diff saved to https://phabricator.wikimedia.org/P64138 and previous config saved to /var/cache/conftool/dbconfig/20240606-003208-marostegui.json [00:32:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2210.codfw.wmnet with reason: Maintenance [00:32:12] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [00:32:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2210.codfw.wmnet with reason: Maintenance [00:32:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T364299)', diff saved to https://phabricator.wikimedia.org/P64139 and previous config saved to /var/cache/conftool/dbconfig/20240606-003232-marostegui.json [00:36:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T364069)', diff saved to https://phabricator.wikimedia.org/P64140 and previous config saved to /var/cache/conftool/dbconfig/20240606-003620-marostegui.json [00:36:23] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [00:51:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P64141 and previous config saved to /var/cache/conftool/dbconfig/20240606-005128-marostegui.json [01:06:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P64142 and previous config saved to /var/cache/conftool/dbconfig/20240606-010636-marostegui.json [01:21:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T364069)', diff saved to https://phabricator.wikimedia.org/P64143 and previous config saved to /var/cache/conftool/dbconfig/20240606-012144-marostegui.json [01:21:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [01:21:48] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [01:22:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1244.eqiad.wmnet with reason: Maintenance [01:22:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1244 (T364069)', diff saved to https://phabricator.wikimedia.org/P64144 and previous config saved to /var/cache/conftool/dbconfig/20240606-012208-marostegui.json [01:25:29] (03PS2) 10Anzx: commonswiki: Enable numeric wgCategoryCollation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) [01:28:46] (03CR) 10Anzx: "there are completed throttle exception for T365221 and T364708 which can also be removed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) (owner: 10Urbanecm) [01:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:11:41] (03PS1) 10Pppery: Rescue libphutil translations (languages below export threshold) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039341 (https://phabricator.wikimedia.org/T366377) [02:13:17] (03PS2) 10Pppery: Rescue libphutil translations (languages below old export threshold) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039341 (https://phabricator.wikimedia.org/T366377) [02:21:04] (03PS3) 10Pppery: Rescue libphutil translations (languages below old export threshold) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039341 (https://phabricator.wikimedia.org/T366377) [02:38:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T352010)', diff saved to https://phabricator.wikimedia.org/P64145 and previous config saved to /var/cache/conftool/dbconfig/20240606-024321-ladsgroup.json [02:43:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:58:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P64146 and previous config saved to /var/cache/conftool/dbconfig/20240606-025828-ladsgroup.json [02:58:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T352010)', diff saved to https://phabricator.wikimedia.org/P64147 and previous config saved to /var/cache/conftool/dbconfig/20240606-030145-ladsgroup.json [03:01:48] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:06:25] RESOLVED: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P64148 and previous config saved to /var/cache/conftool/dbconfig/20240606-031336-ladsgroup.json [03:16:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P64149 and previous config saved to /var/cache/conftool/dbconfig/20240606-031653-ladsgroup.json [03:28:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T352010)', diff saved to https://phabricator.wikimedia.org/P64150 and previous config saved to /var/cache/conftool/dbconfig/20240606-032844-ladsgroup.json [03:28:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [03:28:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:29:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [03:29:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T352010)', diff saved to https://phabricator.wikimedia.org/P64151 and previous config saved to /var/cache/conftool/dbconfig/20240606-032907-ladsgroup.json [03:31:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T364299)', diff saved to https://phabricator.wikimedia.org/P64152 and previous config saved to /var/cache/conftool/dbconfig/20240606-033125-marostegui.json [03:31:37] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [03:32:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P64153 and previous config saved to /var/cache/conftool/dbconfig/20240606-033201-ladsgroup.json [03:43:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P64154 and previous config saved to /var/cache/conftool/dbconfig/20240606-034635-marostegui.json [03:47:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T352010)', diff saved to https://phabricator.wikimedia.org/P64155 and previous config saved to /var/cache/conftool/dbconfig/20240606-034709-ladsgroup.json [03:47:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [03:47:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:47:24] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [03:47:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T352010)', diff saved to https://phabricator.wikimedia.org/P64156 and previous config saved to /var/cache/conftool/dbconfig/20240606-034732-ladsgroup.json [04:01:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P64157 and previous config saved to /var/cache/conftool/dbconfig/20240606-040142-marostegui.json [04:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [04:16:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2210 (T364299)', diff saved to https://phabricator.wikimedia.org/P64158 and previous config saved to /var/cache/conftool/dbconfig/20240606-041650-marostegui.json [04:16:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2219.codfw.wmnet with reason: Maintenance [04:16:54] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:17:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2219.codfw.wmnet with reason: Maintenance [04:17:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T364299)', diff saved to https://phabricator.wikimedia.org/P64159 and previous config saved to /var/cache/conftool/dbconfig/20240606-041714-marostegui.json [04:40:45] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:02:15] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart - ryankemper@cumin2002 - T366555 [05:03:25] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:34] (03PS23) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [05:04:48] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [05:19:56] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart - ryankemper@cumin2002 - T366555 [05:21:23] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart - ryankemper@cumin2002 - T366555 [05:30:55] FIRING: [5x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9400.service on elastic2067:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:55] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:35:55] RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:43] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [05:44:01] (03CR) 10Giuseppe Lavagetto: [C:04-1] mw-mcrouter: Switch helmfile.d to use the newer cache module (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860 (owner: 10Alexandros Kosiaris) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T0600) [06:00:04] kormat, marostegui, Amir1, and arnaudb: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T0600). [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:18:25] (03PS1) 10Andrea Denisse: conftool: Integrate logstash with active-passive configuration [puppet] - 10https://gerrit.wikimedia.org/r/1039406 (https://phabricator.wikimedia.org/T356386) [06:49:37] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1034.eqiad.wmnet [06:53:36] (03PS1) 10Peter Fischer: Search update pipeline: increase fetch timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039528 (https://phabricator.wikimedia.org/T366340) [06:55:13] (03CR) 10Brouberol: [C:03+1] "Assuming that all existing wikikube released have been uninstalled: +1!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038773 (https://phabricator.wikimedia.org/T366338) (owner: 10Stevemunene) [06:55:41] (03CR) 10Peter Fischer: [C:03+2] Search update pipeline: increase fetch timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039528 (https://phabricator.wikimedia.org/T366340) (owner: 10Peter Fischer) [06:56:10] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1034.eqiad.wmnet [06:56:35] (03Merged) 10jenkins-bot: Search update pipeline: increase fetch timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039528 (https://phabricator.wikimedia.org/T366340) (owner: 10Peter Fischer) [07:00:04] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:03:08] easy [07:05:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [07:05:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [07:05:55] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [07:05:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T352010)', diff saved to https://phabricator.wikimedia.org/P64160 and previous config saved to /var/cache/conftool/dbconfig/20240606-070558-ladsgroup.json [07:06:01] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:06:16] !log Restarting Gerrit [07:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:08:45] PROBLEM - Host gerrit1003 is DOWN: PING CRITICAL - Packet loss = 100% [07:09:31] FIRING: [4x] ProbeDown: Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:09:47] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:09:55] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:10:09] RECOVERY - Host gerrit1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [07:14:31] RESOLVED: [4x] ProbeDown: Service gerrit1003:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:15:57] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1039229 (https://phabricator.wikimedia.org/T273950) (owner: 10Muehlenhoff) [07:30:40] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart - ryankemper@cumin2002 - T366555 [07:32:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T364299)', diff saved to https://phabricator.wikimedia.org/P64161 and previous config saved to /var/cache/conftool/dbconfig/20240606-073229-marostegui.json [07:32:33] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:39:55] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2062:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T352010)', diff saved to https://phabricator.wikimedia.org/P64162 and previous config saved to /var/cache/conftool/dbconfig/20240606-074356-ladsgroup.json [07:44:00] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:47:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P64163 and previous config saved to /var/cache/conftool/dbconfig/20240606-074737-marostegui.json [07:48:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:20] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host thanos-be1001.eqiad.wmnet [07:56:22] (03CR) 10Hashar: [C:03+1] "Looks good on the surface though I don't know much about Phabricator Conduit API." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039307 (https://phabricator.wikimedia.org/T366587) (owner: 10BryanDavis) [07:57:40] (03CR) 10Urbanecm: "Oh, it's not chronologically ordered... Done :)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) (owner: 10Urbanecm) [07:57:48] (03PS3) 10Ayounsi: Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) [07:57:48] (03PS3) 10Ayounsi: Netbox 4: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) [07:58:06] (03PS3) 10Urbanecm: Add throttle exception for an upcoming workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) [07:58:12] jouncebot: nowandnext [07:58:12] For the next 0 hour(s) and 1 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T0700) [07:58:12] In 2 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1000) [07:58:13] (03CR) 10CI reject: [V:04-1] Netbox 4: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:58:26] (03PS4) 10Urbanecm: Add throttle exception for an upcoming workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) [07:58:29] (03CR) 10Urbanecm: [C:03+2] Add throttle exception for an upcoming workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) (owner: 10Urbanecm) [07:58:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) (owner: 10Urbanecm) [07:59:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P64164 and previous config saved to /var/cache/conftool/dbconfig/20240606-075904-ladsgroup.json [07:59:30] (03Merged) 10jenkins-bot: Add throttle exception for an upcoming workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039287 (https://phabricator.wikimedia.org/T366748) (owner: 10Urbanecm) [07:59:51] (03PS2) 10Urbanecm: Improve navigation link handling in CommunityConfiguration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038843 (https://phabricator.wikimedia.org/T364938) (owner: 10Sergio Gimeno) [07:59:55] RESOLVED: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:26] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1039287|Add throttle exception for an upcoming workshop (T366748)]] [08:00:29] T366748: Request for a throttle exemption: workshop in Brno - https://phabricator.wikimedia.org/T366748 [08:01:02] (03PS1) 10Santiago Faci: geo-analytics: Documentation improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039560 (https://phabricator.wikimedia.org/T363011) [08:01:30] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be1001.eqiad.wmnet [08:02:25] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host thanos-be1002.eqiad.wmnet [08:02:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P64165 and previous config saved to /var/cache/conftool/dbconfig/20240606-080245-marostegui.json [08:03:17] (03PS1) 10Santiago Faci: device-analytics: Documentation improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039561 (https://phabricator.wikimedia.org/T363010) [08:04:37] (03PS4) 10Ayounsi: Netbox 4: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) [08:04:44] (03Abandoned) 10Hashar: Merge branch 'wmf/stable-3.8' into wmf/stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1039202 (https://phabricator.wikimedia.org/T364342) (owner: 10Hashar) [08:07:35] (03CR) 10Stevemunene: [C:03+2] Clean up datahub from main cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038773 (https://phabricator.wikimedia.org/T366338) (owner: 10Stevemunene) [08:08:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:37] (03CR) 10JMeybohm: kask: add mesh configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [08:09:47] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:10:36] (03Merged) 10jenkins-bot: Clean up datahub from main cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038773 (https://phabricator.wikimedia.org/T366338) (owner: 10Stevemunene) [08:13:53] (03CR) 10Alexandros Kosiaris: "OK. Starting with 99% and will submit followups to lower it." [puppet] - 10https://gerrit.wikimedia.org/r/1039219 (owner: 10Alexandros Kosiaris) [08:13:56] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki-image-download: Support pct based aborted runs [puppet] - 10https://gerrit.wikimedia.org/r/1039219 (owner: 10Alexandros Kosiaris) [08:14:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P64166 and previous config saved to /var/cache/conftool/dbconfig/20240606-081412-ladsgroup.json [08:14:17] !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:14:18] !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:15:55] (03CR) 10Volans: "I don't think it's a good idea to have vulture running, it looks for unused code but without a test suite it will throw a lot of false pos" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 (owner: 10Ayounsi) [08:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [08:16:18] (03PS1) 10Alexandros Kosiaris: mediawiki-image-download.sh: Fix logic [puppet] - 10https://gerrit.wikimedia.org/r/1039606 [08:17:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T364299)', diff saved to https://phabricator.wikimedia.org/P64167 and previous config saved to /var/cache/conftool/dbconfig/20240606-081753-marostegui.json [08:17:58] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:18:35] jouncebot: now and next [08:18:35] No deployments scheduled for the next 1 hour(s) and 41 minute(s) [08:19:28] I'll go ahead with prometheus eqiad/codfw rolling reboots [08:19:37] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [08:19:58] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki-image-download.sh: Fix logic [puppet] - 10https://gerrit.wikimedia.org/r/1039606 (owner: 10Alexandros Kosiaris) [08:21:05] (03CR) 10Volans: "Few comments inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [08:21:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [08:22:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 1%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64168 and previous config saved to /var/cache/conftool/dbconfig/20240606-082204-arnaudb.json [08:22:46] (03CR) 10Filippo Giunchedi: [C:03+1] logstash: limit LogstashKafkaConsumerLag to Logstash-specific consumer groups [alerts] - 10https://gerrit.wikimedia.org/r/1037487 (https://phabricator.wikimedia.org/T366227) (owner: 10Cwhite) [08:23:08] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be1002.eqiad.wmnet [08:23:11] (03CR) 10Filippo Giunchedi: [C:03+1] o11y: add BenthosKafkaConsumerLag alert [alerts] - 10https://gerrit.wikimedia.org/r/1037488 (https://phabricator.wikimedia.org/T366227) (owner: 10Cwhite) [08:23:35] (03CR) 10Filippo Giunchedi: [C:03+1] otelcol: filter out sessionstore user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039292 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [08:23:35] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed - https://phabricator.wikimedia.org/T363119#9866457 (10ABran-WMF) 05In progress→03Resolved host is repooling [08:23:49] (03CR) 10Filippo Giunchedi: [C:03+1] otelcol: filter common healthcheck spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039297 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [08:23:59] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:28:35] (03CR) 10Filippo Giunchedi: "I'm not sure renaming services like this is possible right now; I suggest we take a phased approach:" [puppet] - 10https://gerrit.wikimedia.org/r/1039406 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [08:28:40] (03PS1) 10Hashar: Gerrit 3.9.5, rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1039610 (https://phabricator.wikimedia.org/T354887) [08:28:49] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [08:29:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T352010)', diff saved to https://phabricator.wikimedia.org/P64169 and previous config saved to /var/cache/conftool/dbconfig/20240606-082920-ladsgroup.json [08:29:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [08:29:34] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:29:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [08:29:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T352010)', diff saved to https://phabricator.wikimedia.org/P64170 and previous config saved to /var/cache/conftool/dbconfig/20240606-082943-ladsgroup.json [08:29:54] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [08:35:02] !log pfischer@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:35:20] (03PS1) 10KartikMistry: CX: Fix translation container max width for large screens [extensions/ContentTranslation] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1039571 (https://phabricator.wikimedia.org/T366374) [08:35:26] !log pfischer@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:36:28] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host thanos-be1003.eqiad.wmnet [08:37:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 2%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64171 and previous config saved to /var/cache/conftool/dbconfig/20240606-083710-arnaudb.json [08:37:20] (03PS1) 10GergesShamon: [mswiktionary] Change the default Sitename value to Wikikamus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039612 (https://phabricator.wikimedia.org/T366549) [08:37:57] (03CR) 10Sg912: [C:03+1] "LGTM , Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039560 (https://phabricator.wikimedia.org/T363011) (owner: 10Santiago Faci) [08:38:48] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [08:39:06] (03CR) 10Sg912: [C:03+1] "LGTM , Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039561 (https://phabricator.wikimedia.org/T363010) (owner: 10Santiago Faci) [08:39:10] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [08:39:23] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [08:40:39] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be2001.codfw.wmnet [08:43:29] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2002.codfw.wmnet [08:43:44] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:05] !log pfischer@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:44:20] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1031.eqiad.wmnet [08:47:06] (03CR) 10Ayounsi: "Thanks !" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [08:47:42] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be1003.eqiad.wmnet [08:48:15] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying v0.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039614 (https://phabricator.wikimedia.org/T364177) [08:50:01] (03PS2) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying v0.0.12 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039614 (https://phabricator.wikimedia.org/T364177) [08:50:53] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1031.eqiad.wmnet [08:52:07] jouncebot: now and next [08:52:08] No deployments scheduled for the next 1 hour(s) and 7 minute(s) [08:52:15] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [08:52:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 5%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64172 and previous config saved to /var/cache/conftool/dbconfig/20240606-085216-arnaudb.json [08:52:19] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [08:54:32] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying v0.0.12 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039614 (https://phabricator.wikimedia.org/T364177) (owner: 10Santiago Faci) [08:54:41] (03CR) 10Santiago Faci: [C:03+2] device-analytics: Documentation improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039561 (https://phabricator.wikimedia.org/T363010) (owner: 10Santiago Faci) [08:54:48] (03CR) 10Santiago Faci: [C:03+2] geo-analytics: Documentation improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039560 (https://phabricator.wikimedia.org/T363011) (owner: 10Santiago Faci) [08:55:22] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying v0.0.12 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039614 (https://phabricator.wikimedia.org/T364177) (owner: 10Santiago Faci) [08:55:37] (03Merged) 10jenkins-bot: device-analytics: Documentation improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039561 (https://phabricator.wikimedia.org/T363010) (owner: 10Santiago Faci) [08:55:49] (03Merged) 10jenkins-bot: geo-analytics: Documentation improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039560 (https://phabricator.wikimedia.org/T363011) (owner: 10Santiago Faci) [08:56:02] (03CR) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [08:56:02] (03CR) 10Ayounsi: Fix lots of CI errors (0316 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 (owner: 10Ayounsi) [08:56:24] (03PS17) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [08:56:33] (03PS7) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [08:56:33] (03PS19) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [08:56:52] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [08:56:55] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host thanos-be1004.eqiad.wmnet [08:57:09] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [09:00:30] (03PS1) 10Stevemunene: Delete datahub kubeconfigs on main [puppet] - 10https://gerrit.wikimedia.org/r/1039618 (https://phabricator.wikimedia.org/T366338) [09:01:09] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [09:01:52] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [09:01:56] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be2002.codfw.wmnet [09:05:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T364069)', diff saved to https://phabricator.wikimedia.org/P64173 and previous config saved to /var/cache/conftool/dbconfig/20240606-090529-marostegui.json [09:05:33] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:06:15] (03PS1) 10Alexandros Kosiaris: mediawiki-image-download: Drop to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1039619 (https://phabricator.wikimedia.org/T366778) [09:06:17] (03PS1) 10Alexandros Kosiaris: mediawiki-image-download: Drop to 80% [puppet] - 10https://gerrit.wikimedia.org/r/1039620 (https://phabricator.wikimedia.org/T366778) [09:06:18] (03PS1) 10Alexandros Kosiaris: mediawiki-image-download: Drop to 75% [puppet] - 10https://gerrit.wikimedia.org/r/1039621 (https://phabricator.wikimedia.org/T366778) [09:06:20] (03PS1) 10Alexandros Kosiaris: mediawiki-image-download: Drop to 66% [puppet] - 10https://gerrit.wikimedia.org/r/1039622 (https://phabricator.wikimedia.org/T366778) [09:06:21] (03PS1) 10Alexandros Kosiaris: mediawiki-image-download: Drop to 50% [puppet] - 10https://gerrit.wikimedia.org/r/1039623 (https://phabricator.wikimedia.org/T366778) [09:07:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64174 and previous config saved to /var/cache/conftool/dbconfig/20240606-090722-arnaudb.json [09:08:08] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be1004.eqiad.wmnet [09:09:44] (03CR) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:10:56] (03CR) 10Alexandros Kosiaris: [C:03+1] [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:11:36] !log stevemunene@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:12:38] !log stevemunene@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:12:45] (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:13:32] !log stevemunene@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:14:17] (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:15:15] !log stevemunene@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:16:04] (03CR) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:16:05] I'm restarting the codfw k8s reboots [09:17:09] !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:17:54] !log cgoubert@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw [09:18:57] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2003.codfw.wmnet [09:19:18] (03PS18) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:20:20] !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:20:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P64175 and previous config saved to /var/cache/conftool/dbconfig/20240606-092037-marostegui.json [09:20:57] !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:21:48] (03PS2) 10Mhorsey: Activate campaignEvents extension on Igbo wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038862 (https://phabricator.wikimedia.org/T363199) [09:22:16] !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:22:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64176 and previous config saved to /var/cache/conftool/dbconfig/20240606-092228-arnaudb.json [09:24:35] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS [09:24:35] 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:37] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS6 [09:24:37] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:25:39] (03CR) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:26:15] (03CR) 10Brouberol: [C:03+1] "You will need to delete them by hand after having run puppet on the deploy hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1039618 (https://phabricator.wikimedia.org/T366338) (owner: 10Stevemunene) [09:26:33] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:26:37] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:27:07] (03CR) 10Kosta Harlan: [C:03+1] [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:27:22] (03CR) 10Kosta Harlan: [C:03+1] [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:30:50] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be2003.codfw.wmnet [09:33:34] (03PS1) 10Hnowlan: mesh: publish mesh.configuration 1.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039626 (https://phabricator.wikimedia.org/T362310) [09:33:50] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-be2004.codfw.wmnet [09:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:35:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244', diff saved to https://phabricator.wikimedia.org/P64177 and previous config saved to /var/cache/conftool/dbconfig/20240606-093545-marostegui.json [09:36:20] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9866758 (10Chqaz) @Dzahn What does it mean to reset to the original administrator from T340380? [09:36:39] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A [09:36:39] v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:36:41] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A [09:36:41] v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:37:04] (03PS1) 10Brouberol: Remove noisy monitor that brings no value [alerts] - 10https://gerrit.wikimedia.org/r/1039627 [09:37:23] (03Abandoned) 10Brouberol: Remove noisy monitor that brings no value [alerts] - 10https://gerrit.wikimedia.org/r/1039627 (owner: 10Brouberol) [09:37:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64178 and previous config saved to /var/cache/conftool/dbconfig/20240606-093734-arnaudb.json [09:38:39] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:38:41] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:41:12] (03Restored) 10Brouberol: Remove noisy monitor that brings no value [alerts] - 10https://gerrit.wikimedia.org/r/1039627 (owner: 10Brouberol) [09:41:21] (03PS2) 10Brouberol: Remove noisy monitor that brings no value [alerts] - 10https://gerrit.wikimedia.org/r/1039627 [09:46:20] (03CR) 10Clément Goubert: "LGTM. If you want to test it on the mw-debug release of mw-on-k8s as well, you'll need a patch to `helmfile.d/services/mw-debug/values-{eq" [puppet] - 10https://gerrit.wikimedia.org/r/1039245 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [09:47:21] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host thanos-be2004.codfw.wmnet [09:47:47] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A [09:47:47] v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:48:18] (03PS19) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [09:48:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A [09:48:47] v6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:47] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:49] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1244 (T364069)', diff saved to https://phabricator.wikimedia.org/P64179 and previous config saved to /var/cache/conftool/dbconfig/20240606-095053-marostegui.json [09:50:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [09:50:57] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:51:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [09:52:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64180 and previous config saved to /var/cache/conftool/dbconfig/20240606-095240-arnaudb.json [09:58:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:59:51] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS [09:59:51] 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:59:53] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS6 [09:59:53] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1000) [10:01:51] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:01:55] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:07:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64181 and previous config saved to /var/cache/conftool/dbconfig/20240606-100747-arnaudb.json [10:10:57] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS6 [10:10:57] : Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:10:59] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS6 [10:10:59] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:11:09] (03PS1) 10Giuseppe Lavagetto: Improve ability to override php-fpm configuration in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 [10:11:51] !log pfischer@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:12:57] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:12:59] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:14:23] (03PS1) 10Santiago Faci: MPIC chart: Added missing property:value needed to interact with IDP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039643 (https://phabricator.wikimedia.org/T362642) [10:14:54] (03PS2) 10Santiago Faci: MPIC chart: Added missing property:value needed to interact with IDP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039643 (https://phabricator.wikimedia.org/T362642) [10:25:03] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS6 [10:25:03] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:25:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS6 [10:25:07] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:26:23] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [10:26:52] FIRING: GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [10:27:03] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:27:07] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:27:19] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [10:31:52] RESOLVED: GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [10:32:17] jouncebot: now [10:32:17] For the next 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1000) [10:32:24] jouncebot: next [10:32:24] In 1 hour(s) and 27 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1200) [10:35:37] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [10:35:42] (03PS20) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [10:36:07] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS [10:36:07] 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:36:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS6 [10:36:15] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:36:21] (03PS5) 10Clément Goubert: docker_registry_ha: Puppetize nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1039641 (https://phabricator.wikimedia.org/T366481) [10:37:26] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [10:38:07] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:38:15] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:38:30] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [10:40:05] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [10:40:38] !log sfaci@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [10:40:49] (03PS5) 10EoghanGaffney: lists: Add option to block incoming mail [puppet] - 10https://gerrit.wikimedia.org/r/1038772 [10:41:24] !log sfaci@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [10:41:42] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [10:42:20] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2784/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney) [10:43:46] !log sfaci@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [10:44:15] (03PS1) 10Ilias Sarantopoulos: ml-services: make staging ores-legacy use liftwing staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039651 (https://phabricator.wikimedia.org/T363336) [10:44:22] FIRING: GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [10:45:35] !log sfaci@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [10:45:46] !log sfaci@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [10:45:54] (03PS2) 10Ilias Sarantopoulos: ml-services: make staging ores-legacy use liftwing staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039651 (https://phabricator.wikimedia.org/T363336) [10:46:23] (03CR) 10Volans: Fix lots of CI errors (037 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 (owner: 10Ayounsi) [10:46:44] (03PS2) 10Giuseppe Lavagetto: Improve ability to override php-fpm configuration in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 [10:47:11] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_m [10:47:11] %23BGP_status [10:47:21] !log sfaci@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [10:47:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_ [10:47:23] g%23BGP_status [10:47:54] (03CR) 10Volans: "reply inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [10:49:22] RESOLVED: GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [10:50:11] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:50:23] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:54:28] (03CR) 10Volans: base functions: make sleep() output a bit friendlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman) [10:54:46] FIRING: [2x] Storage /var over 50%: Alert for device ssw1-e1-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [10:55:45] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:56:27] (03PS1) 10Ebrahim: commonswiki: Enable numeric wgCategoryCollation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039657 (https://phabricator.wikimedia.org/T362494) [10:58:45] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [10:58:46] (03CR) 10Ebrahim: "Just rebased the patch to make that "Merge Conflict" banner on the original patch, if the original rebases to master, as requested there, " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039657 (https://phabricator.wikimedia.org/T362494) (owner: 10Ebrahim) [11:01:12] (03PS3) 10Anzx: commonswiki: Enable numeric wgCategoryCollation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) [11:01:15] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_ [11:01:15] g%23BGP_status [11:01:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_ [11:01:27] g%23BGP_status [11:02:19] (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [11:03:15] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:03:27] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:04:07] (03CR) 10Brouberol: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039643 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [11:05:01] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [11:08:57] (03CR) 10EoghanGaffney: [V:03+1] lists: Add option to block incoming mail (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney) [11:09:54] !log klausman@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-eqiad [11:10:47] FIRING: HelmReleaseBadStatus: Helm release data-gateway/main on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=data-gateway - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:12:21] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_ [11:12:21] g%23BGP_status [11:12:31] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_ [11:12:31] g%23BGP_status [11:14:21] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:14:31] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:14:57] (03CR) 10Milimetric: "yes, as I understand it the data can stay but the links are going away" [puppet] - 10https://gerrit.wikimedia.org/r/1039246 (owner: 10Milimetric) [11:15:06] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [11:15:25] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9867031 (10hnowlan) >>! In T364921#9866023, @Scott_French wrote: > The last log line is: > > ` > {"@timestamp":"2024-06-05T22:49:28Z","message":"Conne... [11:15:45] (03CR) 10Klausman: [C:03+2] base functions: make sleep() output a bit friendlier (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman) [11:15:46] (03CR) 10Santiago Faci: [C:03+2] MPIC chart: Added missing property:value needed to interact with IDP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039643 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [11:16:41] (03Merged) 10jenkins-bot: MPIC chart: Added missing property:value needed to interact with IDP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039643 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [11:20:47] RESOLVED: HelmReleaseBadStatus: Helm release data-gateway/main on k8s@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=data-gateway - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:22:21] (03CR) 10Btullis: [C:03+2] dumps/other: remove unused links [puppet] - 10https://gerrit.wikimedia.org/r/1039246 (owner: 10Milimetric) [11:22:39] PROBLEM - BGP status on lsw1-b5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:23:23] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:23:35] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:25:23] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:25:24] !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [11:25:33] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:25:39] RECOVERY - BGP status on lsw1-b5-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:27:56] !log kicking off k8s eqiad restarts - T366555 [11:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:28] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-codfw [11:32:34] (03Abandoned) 10Ebrahim: commonswiki: Enable numeric wgCategoryCollation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039657 (https://phabricator.wikimedia.org/T362494) (owner: 10Ebrahim) [11:33:45] (03CR) 10Ebrahim: "Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) (owner: 10Anzx) [11:37:38] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867080 (10cmooney) Detailed steps are in P64182 [11:38:38] jouncebot: nowandnext [11:38:38] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [11:38:38] In 0 hour(s) and 21 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1200) [11:39:09] effie: shall I deploy mw stuff? or your k8s restarts gonna not be happy [11:39:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [11:40:02] Amir1: I would rather you wait for this batch to finish if that is alright [11:40:18] sure, I have nothing urgent. Let me know once done [11:40:22] it is not the full cluster rather a subset [11:40:59] (03CR) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [11:41:00] Amir1: I will break for lunch, but look for - Cookbook sre.k8s.reboot-nodes finish log line if you dont hear from me [11:41:34] sure. thanks [11:41:37] (03CR) 10Ebrahim: [C:03+1] commonswiki: Enable numeric wgCategoryCollation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) (owner: 10Anzx) [11:42:19] (03Abandoned) 10DCausse: cirrus: relax CirrusConsumerRerenderFetchErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1038705 (owner: 10DCausse) [11:43:55] (03PS3) 10Hnowlan: kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T36399) [11:45:04] (03CR) 10CI reject: [V:04-1] kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T36399) (owner: 10Hnowlan) [11:46:41] (03PS4) 10Hnowlan: kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T36399) [11:48:54] (03CR) 10Hnowlan: kask: add mesh configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T36399) (owner: 10Hnowlan) [11:50:42] (03PS8) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [11:50:42] (03PS20) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [11:51:01] (03PS4) 10Anzx: commonswiki: Enable numeric wgCategoryCollation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) [11:51:35] (03CR) 10Anzx: commonswiki: Enable numeric wgCategoryCollation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) (owner: 10Anzx) [11:54:58] (03PS1) 10Majavah: aptrepo: Add mirror for OpenTofu packages [puppet] - 10https://gerrit.wikimedia.org/r/1039677 (https://phabricator.wikimedia.org/T365696) [11:55:38] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lvs1017.eqiad.wmnet with reason: moving lvs1017 link to row E from spine to leaf [11:55:51] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lvs1017.eqiad.wmnet with reason: moving lvs1017 link to row E from spine to leaf [11:55:56] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867102 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=54328f3a-52e5-42cd-bdf1-26ee5617a4d5) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their... [11:56:34] !log disabling PyBal on lvs1017 to allow for cable move T366361 [11:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:37] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1200) [12:01:10] (03PS9) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [12:01:10] (03PS21) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [12:02:37] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration chart: Added missing config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039682 (https://phabricator.wikimedia.org/T362642) [12:07:31] (03CR) 10Ebrahim: [C:03+1] commonswiki: Enable numeric wgCategoryCollation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) (owner: 10Anzx) [12:08:29] (03PS10) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [12:08:29] (03PS22) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [12:08:40] FIRING: SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:19] (03CR) 10Brouberol: [C:03+1] Metrics Platform Instrument Configuration chart: Added missing config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039682 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [12:11:00] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration chart: Added missing config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039682 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [12:11:56] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration chart: Added missing config property [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039682 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [12:13:21] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:13:25] (03PS11) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [12:13:25] (03PS23) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [12:13:41] (03CR) 10Ayounsi: Fix lots of CI errors (0310 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 (owner: 10Ayounsi) [12:14:27] (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [12:14:41] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:30:00 on 18 hosts with reason: upgrading spine switches eqiad rows e and f [12:14:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on 18 hosts with reason: upgrading spine switches eqiad rows e and f [12:15:09] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867139 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=512f5f90-4832-4c61-b0eb-75b61fcd6f8c) set by cmooney@cumin1002 for 1:30:00 on 18 host(s) and thei... [12:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [12:17:21] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:18:06] (03CR) 10DCausse: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [12:18:20] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039693 (owner: 10L10n-bot) [12:18:40] (03PS1) 10Vgutierrez: depool text@codfw before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039694 (https://phabricator.wikimedia.org/T366466) [12:18:45] (03PS12) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [12:18:45] (03PS24) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [12:20:19] (03PS4) 10Ayounsi: Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) [12:20:20] (03PS5) 10Ayounsi: Netbox 4: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) [12:20:20] (03PS1) 10Ayounsi: Rename ganeti-netbox-sync.py to ganeti_netbox_sync.py [puppet] - 10https://gerrit.wikimedia.org/r/1039697 [12:21:18] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [12:21:49] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [12:21:51] (03PS1) 10Ayounsi: Rename ganeti-netbox-sync.py to ganeti_netbox_sync.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1039700 [12:22:38] (03CR) 10Ayounsi: Fix lots of CI errors (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 (owner: 10Ayounsi) [12:22:47] (03CR) 10CI reject: [V:04-1] Rename ganeti-netbox-sync.py to ganeti_netbox_sync.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1039700 (owner: 10Ayounsi) [12:23:39] I'll +2 my upcoming backport patch in advance (20-25 min) to save the time for the next window. [12:24:55] !log cmooney@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs1017.eqiad.wmnet [12:24:55] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1017.eqiad.wmnet [12:25:13] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lvs1018.eqiad.wmnet with reason: moving lvs1018 link to row E from spine to leaf [12:25:25] (03CR) 10JMeybohm: docker_registry_ha: Puppetize nginx config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039641 (https://phabricator.wikimedia.org/T366481) (owner: 10Clément Goubert) [12:25:26] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lvs1018.eqiad.wmnet with reason: moving lvs1018 link to row E from spine to leaf [12:25:39] !log disabling PyBal on lvs1018 to allow for cable move T366361 [12:25:39] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867154 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=76763bfc-4091-4d8a-b3f8-e84d96a9bd49) set by cmooney@cumin1002 for 0:40:00 on 1 host(s) and their... [12:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:42] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [12:26:17] (03PS2) 10Ayounsi: Rename ganeti-netbox-sync.py to ganeti_netbox_sync.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1039700 [12:27:45] (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic1@codfw for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039706 (https://phabricator.wikimedia.org/T366466) [12:27:47] (03PS1) 10Vgutierrez: hiera: Enable IPIP on text@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1039707 (https://phabricator.wikimedia.org/T366466) [12:28:56] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4051.ulsfo.wmnet [12:29:34] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4051.ulsfo.wmnet [12:29:40] (03PS1) 10Ayounsi: black and isort all the files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1039708 [12:30:01] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1039706 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:30:19] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:30:24] (03CR) 10Fabfur: [C:03+1] depool text@codfw before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039694 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:30:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:30:42] (03CR) 10CI reject: [V:04-1] black and isort all the files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1039708 (owner: 10Ayounsi) [12:30:50] (03CR) 10Fabfur: [C:03+1] hiera: Enable IPIP on high-traffic1@codfw for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039706 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:31:03] (03CR) 10Fabfur: [C:03+1] hiera: Enable IPIP on text@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1039707 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:31:38] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2786/co" [puppet] - 10https://gerrit.wikimedia.org/r/1039707 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:32:17] (03CR) 10JMeybohm: [V:03+2 C:03+2] "Ah, damn. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039626 (https://phabricator.wikimedia.org/T362310) (owner: 10Hnowlan) [12:33:03] (03Merged) 10jenkins-bot: mesh: publish mesh.configuration 1.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039626 (https://phabricator.wikimedia.org/T362310) (owner: 10Hnowlan) [12:33:18] (03CR) 10JMeybohm: kask: add mesh configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T36399) (owner: 10Hnowlan) [12:33:41] (03CR) 10Vgutierrez: [C:03+2] depool text@codfw before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039694 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:33:47] !log depool text@codfw before enabling IPIP encapsulation - T366466 [12:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:51] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [12:33:52] !log disabling BGP to ssw1-e1-eqiad from cr1-eqiad in advance of upgrade T366361 [12:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:56] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [12:34:11] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:34:27] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52065 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:35:57] (03PS2) 10Ayounsi: black and isort all the files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1039708 [12:36:13] (03CR) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [12:36:19] (03CR) 10Effie Mouzeli: [C:03+2] [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [12:36:37] PROBLEM - Check whether ferm is active by checking the default input chain on mw1363 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:37:59] (03CR) 10KartikMistry: [C:03+2] CX: Fix translation container max width for large screens [extensions/ContentTranslation] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1039571 (https://phabricator.wikimedia.org/T366374) (owner: 10KartikMistry) [12:38:13] ^^ advance +2 for the next backport window. [12:39:04] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4051.ulsfo.wmnet [12:39:43] !log rebooting ssw1-e1-eqiad to upgrade JunOS [12:39:44] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic1@codfw for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039706 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:39:45] (03CR) 10JMeybohm: [C:03+1] miscweb: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [12:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:14] (03CR) 10Anzx: [C:04-1] "wgMetaNamespace in core-Namespaces.php also needs change as per task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039612 (https://phabricator.wikimedia.org/T366549) (owner: 10GergesShamon) [12:40:16] (03CR) 10JMeybohm: [C:03+1] trafficserver: move k8s traffic shift to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [12:40:41] (03CR) 10Anzx: [mswiktionary] Change the default Sitename value to Wikikamus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039612 (https://phabricator.wikimedia.org/T366549) (owner: 10GergesShamon) [12:40:48] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4051.ulsfo.wmnet [12:42:15] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9867210 (10MatthewVernon) @Eevans are you OK to do this, please? Should just be a case of checking `swift-dispersion-report` and... [12:42:25] (03CR) 10JMeybohm: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert) [12:43:31] (03PS1) 10Jcrespo: dbbackups: Drain dbprov1001, dbprov1002, dbprov2001 & dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/1039713 (https://phabricator.wikimedia.org/T362509) [12:43:42] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9867212 (10MatthewVernon) @Eevans would you be OK to handle this as well, please? It's a bit more involved as you'll need to run... [12:44:48] (03PS2) 10Jcrespo: dbbackups: Drain dbprov1001, dbprov1002, dbprov2001 & dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/1039713 (https://phabricator.wikimedia.org/T362509) [12:44:55] !log disabling PyBal on lvs1019 to allow for cable move T366361 [12:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:58] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [12:45:15] (03CR) 10JMeybohm: [C:04-1] sextant cache: Allow defining mcrouter's clusterIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038858 (owner: 10Alexandros Kosiaris) [12:45:20] (03CR) 10JMeybohm: [C:03+1] sextant cache: Add new service major version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038857 (owner: 10Alexandros Kosiaris) [12:46:03] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 223, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:46:06] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on text@codfw [puppet] - 10https://gerrit.wikimedia.org/r/1039707 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:47:25] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [12:47:41] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9867218 (10MatthewVernon) @Eevans you OK to handle this, please? Should just be a quick cluster health check afterwards. [12:47:47] (03CR) 10JMeybohm: [C:03+2] sre.discovery.service-route: customize lock args [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:47:50] (03CR) 10JMeybohm: [C:03+2] sre.discovery.datacenter: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:47:55] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:48:37] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [12:49:19] (03CR) 10AikoChou: [C:03+1] ml-services: make staging ores-legacy use liftwing staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039651 (https://phabricator.wikimedia.org/T363336) (owner: 10Ilias Sarantopoulos) [12:49:55] (03CR) 10JMeybohm: [C:03+1] mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (owner: 10Alexandros Kosiaris) [12:50:12] !log rolling restart of pybal on lvs2014 and lvs2011 - T366466 [12:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:15] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [12:51:12] (03PS9) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [12:51:12] (03PS5) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) [12:51:12] (03PS5) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) [12:51:26] (03CR) 10Giuseppe Lavagetto: Add new chart statsd-exporter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [12:51:35] (03Merged) 10jenkins-bot: sre.discovery.datacenter: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:52:12] (03Merged) 10jenkins-bot: sre.discovery.service-route: customize lock args [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:53:49] (03CR) 10Ayounsi: Include vlans with an IRB int in device vlans even if not on L2 port (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [12:53:51] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: make staging ores-legacy use liftwing staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039651 (https://phabricator.wikimedia.org/T363336) (owner: 10Ilias Sarantopoulos) [12:54:43] (03Merged) 10jenkins-bot: ml-services: make staging ores-legacy use liftwing staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039651 (https://phabricator.wikimedia.org/T363336) (owner: 10Ilias Sarantopoulos) [12:54:55] RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:55:04] (03Merged) 10jenkins-bot: CX: Fix translation container max width for large screens [extensions/ContentTranslation] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1039571 (https://phabricator.wikimedia.org/T366374) (owner: 10KartikMistry) [12:55:25] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:55:32] (03PS1) 10Vgutierrez: Revert "depool text@codfw before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1039581 (https://phabricator.wikimedia.org/T366466) [12:55:50] (03CR) 10JMeybohm: [C:03+1] chromium-render: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037196 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [12:56:01] !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:56:03] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:56:31] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad [12:57:07] (03CR) 10Vgutierrez: [C:03+2] Revert "depool text@codfw before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1039581 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:57:15] !log repool text@cofw with IPIP encapsulation enabled - T366466 [12:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:18] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [12:57:21] even codfw :) [12:57:40] bblack, claime, Emperor, XioNoX, topranks ^^ [12:57:42] (03CR) 10JMeybohm: "AIUI this is going to go away, but I'm not sure if we're also going to remove the chart (I hope we do)... @hnowlan@wikimedia.org do you ha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037194 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [12:58:06] (03CR) 10JMeybohm: "unresolve" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037194 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [12:58:39] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 83 connections established with conf1007.eqiad.wmnet:4001 (min=83) https://wikitech.wikimedia.org/wiki/PyBal [12:58:40] (03CR) 10JMeybohm: "So many incomplete comments. Here is the context: https://phabricator.wikimedia.org/T345274" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037194 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [13:00:02] Amir1: you may deploy if not already [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1300). [13:00:04] \o/ [13:00:05] cmelo, kart_, Gerges, anzx, and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:15] o/ [13:00:22] Hi [13:00:24] o) [13:00:34] hello [13:00:38] * kart_ is here [13:01:04] mine can go last [13:01:12] cmelo: You can start. [13:01:14] (03CR) 10JMeybohm: [C:03+1] eventstreams: add securityContext to all production containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037861 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [13:01:16] I'm still waiting on verification from SRE of a corresponding puppet deployment [13:02:10] ok, thanks [13:02:38] o/ [13:03:00] (03CR) 10JMeybohm: [C:03+1] shellbox: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037615 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [13:04:06] Any deployer(s) around? :) [13:04:10] * TheresNoTime can deploy [13:04:31] TheresNoTime: Please start with cmelo's patch, then I can deploy my patch (already merged) [13:04:42] ack, starting with cmelo's patch [13:04:55] thanks [13:05:14] (03CR) 10JMeybohm: [C:03+1] kask: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037195 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [13:05:31] urbanecm: `backport is locked by urbanecm (pid 15066) on Thu Jun 6 07:58:42 2024; reason is "Backport for [[gerrit:1039287|Add throttle exception for an upcoming workshop (T366748)]]".` [13:05:32] T366748: Request for a throttle exemption: workshop in Brno - https://phabricator.wikimedia.org/T366748 [13:05:34] I removed mine from the calendar, and will sync it on Monday [13:05:40] effie: thanks. I wait for the deployment window to finish first [13:06:37] RECOVERY - Check whether ferm is active by checking the default input chain on mw1363 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:06:49] (03CR) 10JMeybohm: [C:03+1] termbox: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037193 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [13:07:05] Amir1: cool [13:07:10] jouncebot: next [13:07:10] In 1 hour(s) and 52 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1500) [13:07:39] (03CR) 10JMeybohm: [C:03+1] wikifeeds: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037164 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [13:07:56] fyi, I can't deploy at the moment as there's still a lock in place from a running scap [13:07:58] TheresNoTime: hmm, that's from several hours ago.. let's see if Martin is still around, otherwise we might have to manually stop that deployment + unlock it [13:08:26] !log klausman@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-eqiad [13:09:16] taavi: ack yeah [13:09:32] urbanecm: one last ping, you seem to have a forgotten deploy in process [13:10:31] so that means https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/a8344d7e7cce6ecd390e5c8ca5ceea59259fbe8e%5E%21/#F0 is in practice undeployed, it seems relatively safe to just sync out but we can also revert it [13:11:43] !log taavi@deploy1002 ~ $ sudo kill 32174 # kill forgotten scap sync-world process [13:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:02] taavi: thank you [13:12:03] TheresNoTime: in theory you should be good to deploy now, assuming you're fine syncing that patch out too [13:12:09] yeaah :) [13:12:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038862 (https://phabricator.wikimedia.org/T363199) (owner: 10Mhorsey) [13:12:37] RECOVERY - BGP status on ssw1-e1-eqiad.mgmt is OK: BGP OK - up: 16, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:15:10] (03Merged) 10jenkins-bot: Activate campaignEvents extension on Igbo wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038862 (https://phabricator.wikimedia.org/T363199) (owner: 10Mhorsey) [13:15:10] jouncebot: nowandnext [13:15:10] For the next 0 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1300) [13:15:10] In 1 hour(s) and 46 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1500) [13:15:10] (03CR) 10Ladsgroup: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1039713 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [13:15:10] (03PS9) 10Clément Goubert: sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 [13:15:19] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1038862|Activate campaignEvents extension on Igbo wiki. (T363199)]] [13:15:22] T363199: Add configs on mediawiki-config to enable CampaignEvents on igbo wikipedia (igbowiki) - https://phabricator.wikimedia.org/T363199 [13:15:37] (03CR) 10Jelto: "one naming comment in-line, otherwise it looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney) [13:15:52] (03CR) 10JMeybohm: [C:03+1] "This should not interfere with the selected apparmor profile" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037162 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [13:15:58] (03CR) 10JMeybohm: [C:03+1] function-orchestrator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037163 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [13:16:16] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867277 (10cmooney) The first phase of this is complete, ssw1-e1-eqiad has been upgraded. I am going to pause before completing ssw1-f1-eqiad as some of the output is stran... [13:16:18] (03PS6) 10Clément Goubert: docker_registry_ha: Puppetize nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1039641 (https://phabricator.wikimedia.org/T366481) [13:16:56] !log samtar@deploy1002 mhorsey and samtar: Backport for [[gerrit:1038862|Activate campaignEvents extension on Igbo wiki. (T363199)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:17:07] (03CR) 10Clément Goubert: docker_registry_ha: Puppetize nginx config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039641 (https://phabricator.wikimedia.org/T366481) (owner: 10Clément Goubert) [13:17:16] cmelo: patch live on mwdebug, can you test? [13:18:07] (03CR) 10Jcrespo: [C:03+2] dbbackups: Drain dbprov1001, dbprov1002, dbprov2001 & dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/1039713 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [13:18:08] TheresNoTime: Once you're done, can you ping? Have a minor backport to do. [13:18:20] James_F: will do [13:18:25] Thanks! [13:19:16] Yes, tested, it and sounds good!!! thanks [13:19:49] !log samtar@deploy1002 mhorsey and samtar: Continuing with sync [13:20:30] tysm [13:21:24] Thanks TheresNoTime! [13:21:49] (03PS2) 10GergesShamon: [mswiktionary] Change the default Sitename value to Wikikamus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039612 (https://phabricator.wikimedia.org/T366549) [13:22:44] kart_: just realised that I've sync'd your change [13:23:27] how? [13:23:37] (03PS5) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [13:24:57] TheresNoTime: in that case, patch works as expected. You can deploy it :) [13:25:11] kart_: was already merged, and when I sync'd 1038862 I skipped over the fact the repo was dirty because I assumed it was T366748 [13:25:12] T366748: Request for a throttle exemption: workshop in Brno - https://phabricator.wikimedia.org/T366748 [13:25:25] (03PS6) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [13:26:27] (03CR) 10Cwhite: [C:03+2] logstash: limit LogstashKafkaConsumerLag to Logstash-specific consumer groups [alerts] - 10https://gerrit.wikimedia.org/r/1037487 (https://phabricator.wikimedia.org/T366227) (owner: 10Cwhite) [13:26:55] TheresNoTime: no issue, just go ahead and deploy both. [13:27:01] ack, phew :) [13:27:23] (03CR) 10Cwhite: [C:03+2] o11y: add BenthosKafkaConsumerLag alert [alerts] - 10https://gerrit.wikimedia.org/r/1037488 (https://phabricator.wikimedia.org/T366227) (owner: 10Cwhite) [13:27:39] (03Merged) 10jenkins-bot: logstash: limit LogstashKafkaConsumerLag to Logstash-specific consumer groups [alerts] - 10https://gerrit.wikimedia.org/r/1037487 (https://phabricator.wikimedia.org/T366227) (owner: 10Cwhite) [13:28:25] Gerges: fyi your patch next [13:28:30] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1038862|Activate campaignEvents extension on Igbo wiki. (T363199)]] (duration: 14m 07s) [13:28:33] T363199: Add configs on mediawiki-config to enable CampaignEvents on igbo wikipedia (igbowiki) - https://phabricator.wikimedia.org/T363199 [13:28:36] Ok [13:28:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039612 (https://phabricator.wikimedia.org/T366549) (owner: 10GergesShamon) [13:29:10] (03Merged) 10jenkins-bot: o11y: add BenthosKafkaConsumerLag alert [alerts] - 10https://gerrit.wikimedia.org/r/1037488 (https://phabricator.wikimedia.org/T366227) (owner: 10Cwhite) [13:29:21] 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797 (10Clement_Goubert) 03NEW [13:29:31] (03Merged) 10jenkins-bot: [mswiktionary] Change the default Sitename value to Wikikamus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039612 (https://phabricator.wikimedia.org/T366549) (owner: 10GergesShamon) [13:30:02] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1039612|[mswiktionary] Change the default Sitename value to Wikikamus (T366549)]] [13:30:07] T366549: Rename namespace "Wiktionary" to "Wikikamus" in ms.wiktionary.org - https://phabricator.wikimedia.org/T366549 [13:31:45] TheresNoTime: Let me know when scap is done. I can deploy my patch. [13:32:32] !log samtar@deploy1002 samtar and gergesshamon: Backport for [[gerrit:1039612|[mswiktionary] Change the default Sitename value to Wikikamus (T366549)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:32:36] Gerges: on mwdebug for testing [13:33:53] kart_: oh, sure.. did I *not* sync & deploy it during 1038862 then? [13:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:36:07] Hi TheresNoTime, Only the site name has changed, and the project namespace name has not changed [13:36:18] Gerges: wgMetaNamespace in core-Namespaces.php also needs change as per task [13:36:44] TheresNoTime: sync was done, but we need to deploy it. [13:37:04] Gerges: per anzx, will sync this but you'll need to do a follow up patch it seems [13:37:06] !log samtar@deploy1002 samtar and gergesshamon: Continuing with sync [13:37:25] anzx: You are supposed to get the value of wgMetaNamespace from wgSitename [13:37:27] kart_: ah, my bad sorry (: will ping you once this sync is done and let you do that [13:37:37] No worries! [13:38:20] (03CR) 10Alexandros Kosiaris: [C:04-1] "Nit in the commit message, otherwise LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1039641 (https://phabricator.wikimedia.org/T366481) (owner: 10Clément Goubert) [13:38:22] Anyway I will set the value of wgMetaNamespace in the patch [13:38:54] (03PS7) 10Clément Goubert: docker_registry_ha: Puppetize nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1039641 (https://phabricator.wikimedia.org/T366481) [13:39:36] (03CR) 10Alexandros Kosiaris: [C:03+1] docker_registry_ha: Puppetize nginx config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039641 (https://phabricator.wikimedia.org/T366481) (owner: 10Clément Goubert) [13:39:50] (03CR) 10Clément Goubert: [C:03+2] docker_registry_ha: Puppetize nginx config [puppet] - 10https://gerrit.wikimedia.org/r/1039641 (https://phabricator.wikimedia.org/T366481) (owner: 10Clément Goubert) [13:44:10] (03PS1) 10Clément Goubert: docker_registry_ha: Fix nginx conf file path [puppet] - 10https://gerrit.wikimedia.org/r/1039723 [13:44:19] I can't update the patch there is a conflict [13:44:37] (03CR) 10Clément Goubert: [C:03+2] docker_registry_ha: Fix nginx conf file path [puppet] - 10https://gerrit.wikimedia.org/r/1039723 (owner: 10Clément Goubert) [13:44:42] Gerges: you will need to do a follow up, a separate new patch to make that change [13:44:45] o/ [13:44:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host wikikube-ctrl1001.eqiad.wmnet [13:44:50] Ok [13:44:54] !log kamila@cumin1002 START - Cookbook sre.hosts.dhcp for host wikikube-ctrl1001.eqiad.wmnet [13:44:59] TheresNoTime: hi, are you the deployer [13:45:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host wikikube-ctrl1001.eqiad.wmnet [13:45:14] Nemoralis: yes [13:45:37] Would you mind backporting this patch too? I forgot to schedule it [13:45:38] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1037505 [13:46:08] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1039612|[mswiktionary] Change the default Sitename value to Wikikamus (T366549)]] (duration: 16m 05s) [13:46:10] Nemoralis: should be fine, but will be a little while yet until I can [13:46:11] T366549: Rename namespace "Wiktionary" to "Wikikamus" in ms.wiktionary.org - https://phabricator.wikimedia.org/T366549 [13:46:31] kart_: please go ahead, and then I'll do anzx & Nemoralis's patches afterwards [13:46:49] Thanks [13:47:00] Nemoralis: can you add it to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1300 please? [13:47:45] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1039571|CX: Fix translation container max width for large screens (T366374)]] [13:47:48] T366374: Tools section showing in the bottom in Content Translation tool - https://phabricator.wikimedia.org/T366374 [13:47:54] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [13:48:00] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9867423 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [13:48:14] (03PS10) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [13:48:14] (03PS6) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) [13:48:14] (03PS6) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) [13:48:14] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow passing variables to php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039724 [13:48:15] (03CR) 10Clément Goubert: "Created https://phabricator.wikimedia.org/T366797 for this" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert) [13:48:19] (03PS1) 10Santiago Faci: MPIC chart: Fixed a typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039725 (https://phabricator.wikimedia.org/T362642) [13:49:07] (03CR) 10CI reject: [V:04-1] mediawiki: allow passing variables to php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039724 (owner: 10Giuseppe Lavagetto) [13:49:33] TheresNoTime: done [13:49:39] thanks :) [13:50:15] !log kartik@deploy1002 kartik: Backport for [[gerrit:1039571|CX: Fix translation container max width for large screens (T366374)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:50:46] TheresNoTime: I will publish the patch and schedule it in the UTC late backport window. [13:50:58] Gerges: sounds good, thank you :) [13:51:29] (03PS5) 10Anzx: commonswiki: Enable numeric wgCategoryCollation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) [13:52:11] (03PS2) 10NMW03: Add project namespace alias for Azerbaijani Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037505 (https://phabricator.wikimedia.org/T365966) [13:52:13] !log kartik@deploy1002 kartik: Continuing with sync [13:52:23] (03PS1) 10JMeybohm: mesh: Update mesh.name dependency to mesh.configuration:1.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039726 (https://phabricator.wikimedia.org/T362310) [13:52:25] (03PS1) 10JMeybohm: flink-app: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) [13:53:03] PROBLEM - Check whether ferm is active by checking the default input chain on mw1421 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:53:09] (03CR) 10CI reject: [V:04-1] flink-app: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [13:53:18] jouncebot: nowandnext [13:53:18] For the next 0 hour(s) and 6 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1300) [13:53:18] In 1 hour(s) and 6 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1500) [13:53:18] (03CR) 10CI reject: [V:04-1] mesh: Update mesh.name dependency to mesh.configuration:1.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039726 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [13:55:21] Hard to deploy more than 3 patches in a single window :) [13:55:52] (03PS1) 10GergesShamon: [mswiktionary] Rename namespace "Wiktionary" to "Wikikamus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039729 (https://phabricator.wikimedia.org/T366549) [13:56:18] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4050.ulsfo.wmnet [13:56:35] (03PS2) 10JMeybohm: mesh: Update mesh.name dependency to mesh.configuration:1.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039726 (https://phabricator.wikimedia.org/T362310) [13:56:36] (03PS2) 10JMeybohm: flink-app: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) [13:56:57] Hi TheresNoTime [13:56:59] (03CR) 10Sergio Gimeno: "question: should we split each flag in a separate change to run the script in between?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) (owner: 10Urbanecm) [13:57:08] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4050.ulsfo.wmnet [13:57:28] Gerges: hi [13:57:34] TheresNoTime: new patch https://gerrit.wikimedia.org/r/1039729 :) [13:57:51] Is there time left? [13:58:01] (03CR) 10Sergio Gimeno: "Not resolved" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) (owner: 10Urbanecm) [13:58:12] taavi: TheresNoTime: so sorry about that! I was convinced I finished the deploy, but [13:58:21] Gerges: not really, there's two more patches left — I might be able to do it in a bit though, but please schedule it for the next deployment window [13:58:22] ...apparently that was not the case. [13:58:36] urbanecm: np! :D [13:58:55] Also, hello! [13:59:00] o/ [13:59:15] o/ [13:59:18] TheresNoTime: OK, no problem [13:59:24] you free urbanecm? I have a meeting in a moment and there's still two patches to deploy.. [13:59:36] urbanecm: no worries, it happens :-) [13:59:44] (03CR) 10JMeybohm: [C:03+2] mesh: Update mesh.name dependency to mesh.configuration:1.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039726 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [14:00:40] (03Merged) 10jenkins-bot: mesh: Update mesh.name dependency to mesh.configuration:1.8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039726 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [14:00:56] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1039571|CX: Fix translation container max width for large screens (T366374)]] (duration: 13m 11s) [14:01:00] T366374: Tools section showing in the bottom in Content Translation tool - https://phabricator.wikimedia.org/T366374 [14:01:18] TheresNoTime: Not really unfortunately, I just skimmed at my IRC backlog at a phone [14:01:24] urbanecm: ack :) [14:01:35] kart_: your deployment done? [14:01:57] TheresNoTime: yes. Just now. [14:02:30] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: host reimage [14:02:31] anzx and Nemoralis: doing yours now together [14:02:40] TheresNoTime: ok [14:02:45] ok [14:02:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) (owner: 10Anzx) [14:02:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037505 (https://phabricator.wikimedia.org/T365966) (owner: 10NMW03) [14:03:27] (03CR) 10Michael Große: "For a production wiki I think it is a good idea to do this in multiple steps, but for testwiki, I think it is ok to run the script after b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) (owner: 10Urbanecm) [14:03:31] (03PS3) 10JMeybohm: flink-app: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) [14:03:36] (03Merged) 10jenkins-bot: commonswiki: Enable numeric wgCategoryCollation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) (owner: 10Anzx) [14:03:40] (03Merged) 10jenkins-bot: Add project namespace alias for Azerbaijani Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037505 (https://phabricator.wikimedia.org/T365966) (owner: 10NMW03) [14:04:15] !log samtar@deploy1002 Started scap: Backport for [[gerrit:1037006|commonswiki: Enable numeric wgCategoryCollation (T362494)]], [[gerrit:1037505|Add project namespace alias for Azerbaijani Wikisource (T365966)]] [14:04:19] T362494: Enable numerical category sorting on Commons - https://phabricator.wikimedia.org/T362494 [14:04:20] T365966: Add "VM" namespace alias to Azerbaijani Wikisource - https://phabricator.wikimedia.org/T365966 [14:05:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: host reimage [14:06:01] (03CR) 10CDobbins: [V:03+1 C:03+2] purged: set use_pki to true for all eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1038881 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [14:06:38] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4050.ulsfo.wmnet [14:06:43] !log samtar@deploy1002 samtar and anzx and nmw03: Backport for [[gerrit:1037006|commonswiki: Enable numeric wgCategoryCollation (T362494)]], [[gerrit:1037505|Add project namespace alias for Azerbaijani Wikisource (T365966)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:46] anzx and Nemoralis: patches are on mwdebug, can you both test your respective bits? [14:07:45] LGTM [14:07:49] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4050.ulsfo.wmnet [14:07:54] Nemoralis: ack [14:08:19] will sync once I hear from anzx [14:08:22] I think category collation will work after running the maintenance script, right? [14:08:27] TheresNoTime: LGTM, mine need https://www.mediawiki.org/wiki/Manual:UpdateCollation.php to view change [14:08:43] yeah, we tested this in Bosnian Wikiquote with Lucas [14:08:45] anzx: ack, syncing & will run after [14:08:48] I schedule a change via this tool https://schedule-deployment.toolforge.org/, But no change was added to https://wikitech.wikimedia.org/wiki/Deployments [14:08:48] !log samtar@deploy1002 samtar and anzx and nmw03: Continuing with sync [14:08:56] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1034941 [14:09:09] Gerges: same. I had to add mine manually [14:09:47] Ok [14:11:26] (03PS1) 10Ayounsi: rename: check for iDRAC version first [cookbooks] - 10https://gerrit.wikimedia.org/r/1039732 [14:11:34] Gerges: hm, odd.. ^ fyi for bd808 I guess [14:11:41] (03PS6) 10EoghanGaffney: lists: Add option to block incoming mail [puppet] - 10https://gerrit.wikimedia.org/r/1038772 [14:13:00] (03CR) 10EoghanGaffney: lists: Add option to block incoming mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney) [14:13:43] (03CR) 10Ladsgroup: [C:03+1] Upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [14:14:12] !log sudo cumin 'A:cp and A:eqsin' 'disable-puppet "merging CR 1038881"' [14:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:29] !log disabling BGP on cr2-eqiad towards ssw1-f1-eqiad prior to upgrade of ssw later T366361 [14:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:32] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [14:15:27] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ssw1-f1-eqiad,ssw1-f1-eqiad IPv6,ssw1-f1-eqiad.mgmt with reason: upgrading spine switches eqiad rows e and f [14:15:42] (03CR) 10CI reject: [V:04-1] rename: check for iDRAC version first [cookbooks] - 10https://gerrit.wikimedia.org/r/1039732 (owner: 10Ayounsi) [14:15:43] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ssw1-f1-eqiad,ssw1-f1-eqiad IPv6,ssw1-f1-eqiad.mgmt with reason: upgrading spine switches eqiad rows e and f [14:15:52] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867516 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=2e3e9f53-54b4-4b8d-b9d6-ab280392b41c) set by cmooney@cumin1002 for 2:00:00 on 3 host(s) and their... [14:15:55] (03CR) 10Ladsgroup: [C:03+1] "Thank you so much <3" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [14:16:28] T366794 ig [14:16:29] T366794: Did not update deployment calendar - https://phabricator.wikimedia.org/T366794 [14:16:53] (03PS2) 10Ayounsi: rename: check for iDRAC version first [cookbooks] - 10https://gerrit.wikimedia.org/r/1039732 [14:17:13] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:1037006|commonswiki: Enable numeric wgCategoryCollation (T362494)]], [[gerrit:1037505|Add project namespace alias for Azerbaijani Wikisource (T365966)]] (duration: 12m 58s) [14:17:17] T362494: Enable numerical category sorting on Commons - https://phabricator.wikimedia.org/T362494 [14:17:18] T365966: Add "VM" namespace alias to Azerbaijani Wikisource - https://phabricator.wikimedia.org/T365966 [14:17:34] !log hashar@deploy1002 Started deploy [integration/docroot@eee90e6]: Build dependencies updates [14:17:44] !log hashar@deploy1002 Finished deploy [integration/docroot@eee90e6]: Build dependencies updates (duration: 00m 09s) [14:18:09] !log hashar@deploy1002 Started deploy [integration/docroot@eee90e6]: Build dependencies updates [14:18:16] thanks TheresNoTime [14:18:20] !log hashar@deploy1002 Finished deploy [integration/docroot@eee90e6]: Build dependencies updates (duration: 00m 10s) [14:18:25] bah [14:18:31] anzx: `UpdateCollation.php` for commonswiki is going to take a while afaik... I'm not sure I want to run that outside of a deployment window? [14:19:05] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney) [14:20:03] TheresNoTime: should I schedule maintenance script run for next window [14:20:03] anzx: I'll finish up my meeting and look at it again [14:20:16] Ok [14:20:31] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2787/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney) [14:20:47] (03CR) 10CI reject: [V:04-1] rename: check for iDRAC version first [cookbooks] - 10https://gerrit.wikimedia.org/r/1039732 (owner: 10Ayounsi) [14:23:04] RECOVERY - Check whether ferm is active by checking the default input chain on mw1421 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:24:14] (03PS7) 10EoghanGaffney: lists: Add option to block incoming mail [puppet] - 10https://gerrit.wikimedia.org/r/1038772 [14:24:44] anzx: https://phabricator.wikimedia.org/T362494#9867564 [14:25:02] !log close UTC afternoon backport window [14:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:18] Woah, update collation for Commons? Yeah, that'll take weeks, won't it? [14:25:40] :D [14:25:44] (03CR) 10Hnowlan: [C:03+2] Upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [14:26:02] James_F: oh crap, sorry — forgot to ping you [14:26:07] (03PS8) 10EoghanGaffney: lists: Add option to block incoming mail [puppet] - 10https://gerrit.wikimedia.org/r/1038772 [14:26:08] TheresNoTime: No worries; good to go now? [14:26:15] yup, I'm all done [14:26:21] Cool. [14:26:23] 10ops-codfw, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T366724#9867575 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm reseated ps1 power cord at both ends. alert cleared. [14:26:29] (03CR) 10Jforrester: [C:03+2] Add wikilambda-edit-monolingual-text-placeholder message to extension.json [extensions/WikiLambda] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038828 (https://phabricator.wikimedia.org/T359782) (owner: 10Jforrester) [14:26:50] (03PS3) 10Ayounsi: rename: check for iDRAC version first [cookbooks] - 10https://gerrit.wikimedia.org/r/1039732 [14:27:04] (03CR) 10Sergio Gimeno: [C:03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) (owner: 10Urbanecm) [14:27:20] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2789/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney) [14:27:59] TheresNoTime: Thank you [14:29:48] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney) [14:30:54] 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797#9867585 (10elukey) [14:31:01] (03CR) 10Hnowlan: "Yes, we are! I have a patch chain open to remove the service, just need to get around to rolling these out. So we can skip this change for" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037194 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:32:00] (03Merged) 10jenkins-bot: Add wikilambda-edit-monolingual-text-placeholder message to extension.json [extensions/WikiLambda] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038828 (https://phabricator.wikimedia.org/T359782) (owner: 10Jforrester) [14:33:14] (03CR) 10JHathaway: [C:03+2] "oh thanks, that's helpful, I didn't realize there was a separate configuration nob. I'll create a patch there as well." [puppet] - 10https://gerrit.wikimedia.org/r/1039245 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [14:35:51] (03Merged) 10jenkins-bot: Upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [14:38:05] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Add option to block incoming mail [puppet] - 10https://gerrit.wikimedia.org/r/1038772 (owner: 10EoghanGaffney) [14:38:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:34] (03PS1) 10EoghanGaffney: lists: Remove `server_uses_stunnel` option [puppet] - 10https://gerrit.wikimedia.org/r/1039735 [14:39:38] (03PS1) 10Peter Fischer: Search update pipeline: enable saneitizer explicitly for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039736 [14:40:50] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2790/co" [puppet] - 10https://gerrit.wikimedia.org/r/1039735 (owner: 10EoghanGaffney) [14:40:59] (03PS1) 10JHathaway: mw-debug: change mail_host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039737 (https://phabricator.wikimedia.org/T365395) [14:41:30] (03CR) 10JHathaway: "kindly review" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039737 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [14:41:31] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1039732 (owner: 10Ayounsi) [14:43:45] !log sudo cumin -b1 -s60 'A:cp and A:eqsin' 'run-puppet-agent --enable "merging CR 1038881"' [14:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:09] (03PS2) 10Peter Fischer: Search update pipeline: enable saneitizer explicitly for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039736 [14:46:23] (03PS3) 10Peter Fischer: Search update pipeline: enable saneitizer explicitly for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039736 [14:49:11] (03PS1) 10Hnowlan: thumbor: upgrade staging to bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039738 (https://phabricator.wikimedia.org/T336881) [14:49:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance [14:49:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance [14:49:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T360332)', diff saved to https://phabricator.wikimedia.org/P64185 and previous config saved to /var/cache/conftool/dbconfig/20240606-144943-arnaudb.json [14:49:47] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [14:51:57] !log kill sessionstore pod running on mw1390.eqiad.wmnet (no dedicated='kask' taint) [14:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T360332)', diff saved to https://phabricator.wikimedia.org/P64186 and previous config saved to /var/cache/conftool/dbconfig/20240606-145205-arnaudb.json [14:54:13] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9867690 (10Scott_French) Thanks for taking a look, @hnowlan! Agreed, yeah - I was initially suspicious of a networking issue, but after verifying that... [14:54:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T352010)', diff saved to https://phabricator.wikimedia.org/P64187 and previous config saved to /var/cache/conftool/dbconfig/20240606-145440-ladsgroup.json [14:54:44] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:54:57] Oh, whoops, never pressed return. [14:55:21] (03CR) 10Ladsgroup: [C:03+1] thumbor: upgrade staging to bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039738 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [14:55:45] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:55] (03CR) 10Hnowlan: [C:03+2] thumbor: upgrade staging to bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039738 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [14:56:10] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1038828|Add wikilambda-edit-monolingual-text-placeholder message to extension.json (T359782)]] [14:56:14] T359782: Make ZMonolingualString's hard-coded English placeholder label for "Enter text" proper i18n - https://phabricator.wikimedia.org/T359782 [14:56:47] (03Merged) 10jenkins-bot: thumbor: upgrade staging to bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039738 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [14:56:52] (03CR) 10Brouberol: [C:03+1] MPIC chart: Fixed a typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039725 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [14:56:57] !log disable ssw1-f1-eqiad leaf-facing ports in advance of upgrade T366361 [14:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:00] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [14:58:23] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [14:58:31] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:58:38] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1038828|Add wikilambda-edit-monolingual-text-placeholder message to extension.json (T359782)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:59:12] (03PS1) 10Ilias Sarantopoulos: profile::services_proxy::envoy: add inference-staging to listeners [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) [14:59:22] RESOLVED: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [14:59:32] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:30:00 on 15 hosts with reason: upgrading spine switches eqiad rows e and f [14:59:33] (03CR) 10CI reject: [V:04-1] profile::services_proxy::envoy: add inference-staging to listeners [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos) [14:59:34] !log jforrester@deploy1002 jforrester: Continuing with sync [14:59:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on 15 hosts with reason: upgrading spine switches eqiad rows e and f [14:59:57] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867739 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e84998aa-eea9-43ce-9047-23b408d134b5) set by cmooney@cumin1002 for 1:30:00 on 15 host(s) and thei... [15:00:04] dduvall and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1500). [15:00:31] (03PS2) 10Ilias Sarantopoulos: profile::services_proxy::envoy: add inference-staging to listeners [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) [15:00:57] (03PS1) 10Majavah: cr-labs: Allow cloudidm-dev hosts to talk to codfw1dev LDAP [homer/public] - 10https://gerrit.wikimedia.org/r/1039742 [15:01:42] Minor worry that my deploy of T362494 has caused T366809 (cc anzx) [15:01:43] T362494: Enable numerical category sorting on Commons - https://phabricator.wikimedia.org/T362494 [15:01:43] T366809: Category pagination broken on Commons - https://phabricator.wikimedia.org/T366809 [15:02:30] Hmm, could do. [15:02:52] In general, category changes are very expensive, and exceptionally-so for Commons. [15:03:10] (03CR) 10Santiago Faci: [C:03+2] MPIC chart: Fixed a typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039725 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [15:03:22] Maybe revert? [15:03:23] (03PS1) 10Hnowlan: thumbor: use bullseye version everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039743 (https://phabricator.wikimedia.org/T336881) [15:03:40] PROBLEM - BGP status on ssw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Active - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:03:45] only ever ran `mwscript updateCollation.php --wiki commonswiki --dry-run` though (and even then, for maybe a minute before the comment to not run that) [15:03:56] (03CR) 10Elukey: profile::services_proxy::envoy: add inference-staging to listeners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos) [15:03:56] PROBLEM - BFD status on ssw1-e1-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:04:01] Yeah, but ever edit/re-parse will use the new code. [15:04:08] Whereas the 100M other pages will be on the old collation. [15:04:09] oh yes [15:04:10] (03Merged) 10jenkins-bot: MPIC chart: Fixed a typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039725 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [15:04:16] Hence the "fun". :-( [15:04:20] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:30:00 on ssw1-e1-eqiad.mgmt with reason: upgrading spine switches eqiad rows e and f [15:04:32] Commons really shouldn't be MW-based, as nothing we have copes at its scale. [15:04:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on ssw1-e1-eqiad.mgmt with reason: upgrading spine switches eqiad rows e and f [15:04:37] (03CR) 10Ladsgroup: [C:03+1] thumbor: use bullseye version everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039743 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [15:04:43] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9867757 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8ea52962-5718-4917-aeee-12b979b25d42) set by cmooney@cumin1002 for 1:30:00 on 1 host(s) and their... [15:04:46] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1021.eqiad.wmnet are marked down but pooled: druid-public-broker_8082: Servers druid1009.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1019.eqiad.wmnet are marked down but pooled: swift_80: Servers ms-fe1009.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1019.eqiad.wmnet are mark [15:04:46] ut pooled https://wikitech.wikimedia.org/wiki/PyBal [15:04:51] (03PS1) 10Jforrester: Revert "commonswiki: Enable numeric wgCategoryCollation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039746 (https://phabricator.wikimedia.org/T366809) [15:05:01] i fear that even after reverting there will be a bunch of pages that were saved with the current value and then would be broken [15:05:17] Sure, but a few tens of thousands vs. millions. [15:05:23] yep [15:05:24] Greatest good for the greatest number. [15:05:53] And categories are a hack that aren't guaranteed to be working, as ever. (Though they're amazingly much better since Tim and others improved them over the past few years.) [15:06:11] TheresNoTime: Want me to push out the revert? [15:06:17] (03CR) 10Majavah: [C:03+1] Revert "commonswiki: Enable numeric wgCategoryCollation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039746 (https://phabricator.wikimedia.org/T366809) (owner: 10Jforrester) [15:06:21] (03CR) 10Hnowlan: [C:03+2] thumbor: use bullseye version everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039743 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [15:06:35] James_F: i.e deploy that patch ^ ? [15:06:39] Yeah. [15:06:58] James_F: yes please, better safe than sorry I suppose [15:07:14] (03Merged) 10jenkins-bot: thumbor: use bullseye version everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039743 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [15:07:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P64188 and previous config saved to /var/cache/conftool/dbconfig/20240606-150714-arnaudb.json [15:07:18] OK, have to twiddle thumbs and wait for scap to finish the fpm-restart. [15:07:42] (03CR) 10Jforrester: [C:03+2] Revert "commonswiki: Enable numeric wgCategoryCollation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039746 (https://phabricator.wikimedia.org/T366809) (owner: 10Jforrester) [15:07:51] I can at least have the CI bit happen in parallel. [15:08:14] (03PS1) 10Scott French: data-gateway: add initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039744 (https://phabricator.wikimedia.org/T364921) [15:08:15] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1038828|Add wikilambda-edit-monolingual-text-placeholder message to extension.json (T359782)]] (duration: 12m 05s) [15:08:19] T359782: Make ZMonolingualString's hard-coded English placeholder label for "Enter text" proper i18n - https://phabricator.wikimedia.org/T359782 [15:08:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039746 (https://phabricator.wikimedia.org/T366809) (owner: 10Jforrester) [15:08:33] (03Merged) 10jenkins-bot: Revert "commonswiki: Enable numeric wgCategoryCollation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039746 (https://phabricator.wikimedia.org/T366809) (owner: 10Jforrester) [15:09:03] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1039746|Revert "commonswiki: Enable numeric wgCategoryCollation" (T366809)]] [15:09:06] T366809: Category pagination broken on Commons - https://phabricator.wikimedia.org/T366809 [15:09:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P64189 and previous config saved to /var/cache/conftool/dbconfig/20240606-150948-ladsgroup.json [15:10:32] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:10:42] (03PS1) 10Santiago Faci: MPIC chart: bumping a new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039745 (https://phabricator.wikimedia.org/T362642) [15:11:34] !log jforrester@deploy1002 jforrester: Backport for [[gerrit:1039746|Revert "commonswiki: Enable numeric wgCategoryCollation" (T366809)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:11:38] (03CR) 10Brouberol: [C:03+1] MPIC chart: bumping a new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039745 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [15:11:56] RECOVERY - BFD status on ssw1-e1-eqiad.mgmt is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:12:13] (03CR) 10Santiago Faci: [C:03+2] MPIC chart: bumping a new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039745 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [15:12:20] !log jforrester@deploy1002 jforrester: Continuing with sync [15:12:40] RECOVERY - BGP status on ssw1-e1-eqiad.mgmt is OK: BGP OK - up: 16, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:12:46] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:12:59] (03PS1) 10Aklapper: AVA: Check earlier if acting user is admin [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039766 (https://phabricator.wikimedia.org/T366811) [15:13:11] (03Merged) 10jenkins-bot: MPIC chart: bumping a new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039745 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [15:14:01] (03PS1) 10Jforrester: Add a note that you cannot change wgCategoryCollation easily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039767 (https://phabricator.wikimedia.org/T362494) [15:14:27] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039590 [15:14:38] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:14:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T352010)', diff saved to https://phabricator.wikimedia.org/P64190 and previous config saved to /var/cache/conftool/dbconfig/20240606-151440-ladsgroup.json [15:14:44] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:14:58] New rule: never touch commons /s [15:15:08] Indeed. :-( [15:15:23] Hmm. [15:15:34] Canary checks failed. [15:16:07] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:16:18] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:16:34] On a third retry they passed. [15:16:38] :shrugs: [15:17:41] (03PS3) 10Ilias Sarantopoulos: profile::services_proxy::envoy: add inference-staging to listeners [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) [15:17:59] (03CR) 10Alexandros Kosiaris: [C:03+1] thumbor: upgrade staging to bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039738 (https://phabricator.wikimedia.org/T336881) (owner: 10Hnowlan) [15:18:31] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "moved wikikube-ctrl1001 to a new rack - kamila@cumin1002 - T366204" [15:18:35] T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204 [15:19:20] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:21:16] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9867836 (10Scott_French) I just had a very interesting conversation with @Sfaci about the initialDelaySeconds recently added to AQS 2.0 services. In s... [15:21:53] (03PS4) 10Ilias Sarantopoulos: profile::services_proxy::envoy: add inference-staging to listeners [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) [15:22:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P64191 and previous config saved to /var/cache/conftool/dbconfig/20240606-152222-arnaudb.json [15:22:54] (03PS5) 10Ilias Sarantopoulos: profile::services_proxy::envoy: add inference-staging to listeners [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) [15:23:02] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1039746|Revert "commonswiki: Enable numeric wgCategoryCollation" (T366809)]] (duration: 13m 58s) [15:23:05] T366809: Category pagination broken on Commons - https://phabricator.wikimedia.org/T366809 [15:23:07] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:23:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "moved wikikube-ctrl1001 to a new rack - kamila@cumin1002 - T366204" [15:23:33] (03CR) 10Hnowlan: [C:03+1] data-gateway: add initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039744 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [15:23:47] (03CR) 10Elukey: profile::services_proxy::envoy: add inference-staging to listeners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos) [15:24:48] (03CR) 10Elukey: [C:03+1] "LGTM! Added some folks from serviceops as well to validate :)" [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos) [15:24:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P64192 and previous config saved to /var/cache/conftool/dbconfig/20240606-152456-ladsgroup.json [15:25:49] (03CR) 10Scott French: "Thanks, Hugh!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039744 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [15:25:54] (03CR) 10Scott French: [C:03+2] data-gateway: add initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039744 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [15:26:50] (03Merged) 10jenkins-bot: data-gateway: add initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039744 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [15:27:11] Gerges: If you have any information on how or when https://schedule-deployment.toolforge.org/ failed for you I would love to hear more. I just did a live test using a random change of yours and things went fine, but there very certainly may be bugs to fix. [15:27:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T352010)', diff saved to https://phabricator.wikimedia.org/P64193 and previous config saved to /var/cache/conftool/dbconfig/20240606-152747-ladsgroup.json [15:27:51] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:29:13] bd808: I think there was a conflict with one of the tools I was using on my browser [15:29:38] (03CR) 10Ilias Sarantopoulos: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos) [15:29:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P64194 and previous config saved to /var/cache/conftool/dbconfig/20240606-152949-ladsgroup.json [15:29:49] !log rebooting ssw1-f1-eqiad to install new JunOS release T366361 [15:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:54] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [15:30:02] (03Abandoned) 10Majavah: cr-labs: Allow cloudidm-dev hosts to talk to codfw1dev LDAP [homer/public] - 10https://gerrit.wikimedia.org/r/1039742 (owner: 10Majavah) [15:32:02] (03CR) 10Volans: [C:03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [15:32:26] Gerges: if you have more details I would be happy to look deeper. [15:32:36] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1039697 (owner: 10Ayounsi) [15:32:56] (03CR) 10Volans: [C:03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1039700 (owner: 10Ayounsi) [15:33:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:35:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 210, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:35:46] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1021.eqiad.wmnet are marked down but pooled: druid-public-broker_8082: Servers druid1011.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1020.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1020.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:36:00] ^ know? [15:36:02] n [15:36:23] (03PS1) 10Ilias Sarantopoulos: ml-services: enable multiprocessing for eswiki-damaging and viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039776 (https://phabricator.wikimedia.org/T349274) [15:37:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T360332)', diff saved to https://phabricator.wikimedia.org/P64195 and previous config saved to /var/cache/conftool/dbconfig/20240606-153730-arnaudb.json [15:37:34] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [15:37:43] (03PS1) 10Hnowlan: Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) [15:37:49] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [15:38:01] (03PS2) 10Hnowlan: Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) [15:38:06] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [15:38:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:52] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [15:39:59] inflatador: ryankemper: something going on with wdqs1020.eqiad.wmnet and wdqs1021.eqiad.wmnet ? [15:40:00] sukhe :eyes on the wdqs hosts [15:40:03] ack [15:40:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T352010)', diff saved to https://phabricator.wikimedia.org/P64196 and previous config saved to /var/cache/conftool/dbconfig/20240606-154004-ladsgroup.json [15:40:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [15:40:07] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [15:40:08] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:40:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [15:40:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64197 and previous config saved to /var/cache/conftool/dbconfig/20240606-154028-ladsgroup.json [15:40:36] (03CR) 10Jforrester: "Done enough by Ica8b989fad1669c10cbc9f7bfe614566f666f99f?" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1038256 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [15:41:17] inflatador, claime: the lvs1020 alerts for wdqs can be ignored [15:41:29] (03CR) 10Clément Goubert: "yep" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1038256 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [15:41:31] I ommited to downtime that host prior to reload of ssw1-f1-eqiad [15:41:36] (03Abandoned) 10Clément Goubert: Empty commit to trigger rebuild [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1038256 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [15:41:37] topranks: ah, thanks, matches up [15:41:37] it's the backup lvs though so not active [15:41:38] (03CR) 10EoghanGaffney: [C:03+1] "Good changes, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1039217 (https://phabricator.wikimedia.org/T347004) (owner: 10Jelto) [15:41:41] hmm [15:41:41] topranks thanks for the heads-up ... just saw that the hosts are in row E and F ;) [15:42:07] (03CR) 10CI reject: [V:04-1] Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [15:42:07] inflatador: what I am intrigued about here is that we have an alert cos the LVS lost L2 reachability to the backends [15:42:20] the switch is rebooting, they'll clear shortly [15:42:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P64198 and previous config saved to /var/cache/conftool/dbconfig/20240606-154255-ladsgroup.json [15:42:57] topranks ah, I didn't get that from the above alerts. At least in the elastic case, pybal just failed silently and didn't depool the hosts [15:42:57] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [15:43:29] inflatador: indeed yeah, I wonder if the healthcheck is doing something different for wdqs instead [15:44:00] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [15:44:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P64199 and previous config saved to /var/cache/conftool/dbconfig/20240606-154457-ladsgroup.json [15:45:12] inflatador: so it seems the vlan interface on lvs1020 has stayed in state "up" [15:45:17] (although shows 'no carrier') [15:45:45] which means connectivity to e/f is still trying to use the direct link, it's not re-routing via the LVS primary link as we seen with elastic incident [15:45:56] in the elastic incident the vlan interfaces didn't exist though hmm [15:45:59] I don't think I've ever seen that for a VLAN interface [15:46:26] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 213, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:49:04] inflatador: seems it set the 'no carrier' flag on both the vlan and physical port, but didn't actually go to state DOWN [15:49:06] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:49:08] https://phabricator.wikimedia.org/P64200 [15:49:10] cool :) [15:49:55] Ah OK, that makes sense if the lower layer was down too [15:52:25] (03CR) 10Ayounsi: [C:03+2] rename: check for iDRAC version first [cookbooks] - 10https://gerrit.wikimedia.org/r/1039732 (owner: 10Ayounsi) [15:56:19] (03Merged) 10jenkins-bot: rename: check for iDRAC version first [cookbooks] - 10https://gerrit.wikimedia.org/r/1039732 (owner: 10Ayounsi) [15:58:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P64201 and previous config saved to /var/cache/conftool/dbconfig/20240606-155804-ladsgroup.json [15:59:20] (03CR) 10Volans: "Much better, almost there. Few minor fixes inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [16:00:05] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T352010)', diff saved to https://phabricator.wikimedia.org/P64202 and previous config saved to /var/cache/conftool/dbconfig/20240606-160004-ladsgroup.json [16:00:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [16:00:18] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:00:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [16:00:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T352010)', diff saved to https://phabricator.wikimedia.org/P64203 and previous config saved to /var/cache/conftool/dbconfig/20240606-160028-ladsgroup.json [16:02:50] (03CR) 10Clément Goubert: "Missing the egress networkpolicy in `helmfile.d/services/mw-debug/values.yaml`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039737 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:03:33] (03PS7) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [16:06:12] FIRING: [2x] ProbeDown: Service wikikube-ctrl1001:6443 has failed probes (http_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wikikube-ctrl1001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:07:02] kamila_: this you ^? [16:07:29] (03CR) 10Effie Mouzeli: "I appreciate the work you have done here, however I feel like we are overengineering and overcomplicating for 2 variables, while we do not" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 (owner: 10Giuseppe Lavagetto) [16:07:35] claime: maybe? XD [16:07:39] !incidents [16:07:40] 4726 (ACKED) [2x] ProbeDown sre (wikikube-ctrl1001:6443 probes/custom eqiad) [16:07:40] 4725 (RESOLVED) Primary inbound port utilisation over 80% (paged) global noc (cr1-eqiad.wikimedia.org) [16:07:51] downtime timed out or something [16:07:55] ack [16:08:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-ctrl1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:09:54] (03PS2) 10JHathaway: mw-debug: change mail_host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039737 (https://phabricator.wikimedia.org/T365395) [16:10:16] (03CR) 10JHathaway: "thanks for catching, added." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039737 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:10:17] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: reimage still running [16:10:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on wikikube-ctrl1001.eqiad.wmnet with reason: reimage still running [16:11:04] claime: I downtimed it some more, sorry for the noise [16:11:11] (03CR) 10Scott French: "Thank you for the reviews. I'll go ahead and get this merged and tagged, and the packages built." [software/conftool] - 10https://gerrit.wikimedia.org/r/1035596 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [16:11:17] (03CR) 10Scott French: [C:03+2] Release 3.0.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/1035596 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [16:11:43] kamila_: no worries [16:12:08] (03PS1) 10Aklapper: Count user transactions in Maniphest only in last two million rows [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039786 (https://phabricator.wikimedia.org/T366811) [16:13:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T352010)', diff saved to https://phabricator.wikimedia.org/P64204 and previous config saved to /var/cache/conftool/dbconfig/20240606-161312-ladsgroup.json [16:13:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [16:13:18] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:13:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [16:13:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T352010)', diff saved to https://phabricator.wikimedia.org/P64205 and previous config saved to /var/cache/conftool/dbconfig/20240606-161338-ladsgroup.json [16:14:57] (03Merged) 10jenkins-bot: Release 3.0.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/1035596 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [16:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [16:24:25] (03CR) 10Scott French: "Thank you both!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037194 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [16:24:34] !log dancy@deploy1002 Installing scap version "4.86.1" for 286 hosts [16:24:50] (03Abandoned) 10Scott French: similar-users: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037194 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [16:25:10] !log dancy@deploy1002 Installing scap version "4.86.1" for 285 hosts [16:25:49] !log dancy@deploy1002 Installation of scap version "4.86.1" completed for 285 hosts [16:25:52] hashar: Give it a try please! [16:26:45] Wrong channel. [16:26:54] trying [16:28:06] !log hashar@deploy1002 Started deploy [integration/docroot@eee90e6]: (no justification provided) [16:28:12] !log hashar@deploy1002 Finished deploy [integration/docroot@eee90e6]: (no justification provided) (duration: 00m 05s) [16:28:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T352010)', diff saved to https://phabricator.wikimedia.org/P64206 and previous config saved to /var/cache/conftool/dbconfig/20240606-162812-ladsgroup.json [16:28:16] magic [16:28:17] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:28:37] hi, is anyone else having trouble connecting to gerrit, or just me? [16:28:40] "Received disconnect from 2620:0:861:2:208:80:154:151 port 29418:12: Too many concurrent connections (8) - max. allowed: 8" [16:29:44] (i get that when i git pull or git push) [16:30:06] . [16:30:10] and i don't have any connections open… as far as i can tell [16:32:36] (03PS5) 10Scott French: DNM: services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) [16:32:36] (03PS5) 10Scott French: DNM: rest-gateway: route commons-analytics via rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023958 (https://phabricator.wikimedia.org/T361835) [16:34:21] (03PS4) 10JMeybohm: flink-app: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) [16:34:21] (03PS1) 10JMeybohm: Fix fixture generation for upstream splits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039788 (https://phabricator.wikimedia.org/T350846) [16:37:06] (03PS2) 10Aklapper: Count user transactions in Maniphest only in last two million rows [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039786 (https://phabricator.wikimedia.org/T366811) [16:37:12] MatmaRex: there have been 44 logins from your user on this day, somehow [16:37:52] I think it's just you, others appear to be uploading just fine [16:39:37] (03PS3) 10Aklapper: Count user transactions in Maniphest only in last two million rows [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039786 (https://phabricator.wikimedia.org/T366811) [16:43:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P64207 and previous config saved to /var/cache/conftool/dbconfig/20240606-164320-ladsgroup.json [16:44:07] MatmaRex: I think I may have fixed that for you. try now [16:44:21] I was able to use special commands to close individual connections [16:44:32] (sorry brb, meeting) [16:45:49] well, I can confirm there were actually 8 connections for your user, and now i closed them all [16:46:32] it wasn't a global problem because the commands to close ssh connections are run over .. the same ssh daemon :p [16:46:55] (03PS1) 10Aklapper: Limit querying latest user transactions in Maniphest to recent IDs [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039791 (https://phabricator.wikimedia.org/T366811) [16:50:27] !log disabling pybal on lvs1019 to move traffic to lvs1020 in advance of cable move T366361 [16:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:31] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [16:51:56] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on lvs1019.eqiad.wmnet with reason: moving lvs1019 link back to ssw1-f1-codfw [16:52:06] (03CR) 10Aklapper: [C:03+1] wikitech: Update Phabricator Conduit calls to disable/enable users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039307 (https://phabricator.wikimedia.org/T366587) (owner: 10BryanDavis) [16:52:10] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on lvs1019.eqiad.wmnet with reason: moving lvs1019 link back to ssw1-f1-codfw [16:58:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P64208 and previous config saved to /var/cache/conftool/dbconfig/20240606-165828-ladsgroup.json [17:00:05] bd808: #bothumor I � Unicode. All rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1700) [17:02:53] (03CR) 10Scott French: [C:03+1] "LGTM! Good catch adding back the mesh annotations after the base.meta bump." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [17:04:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:06:34] (03PS1) 10Dzahn: peopleweb: fix wrong unit, bytes are actually kilobytes in a script [puppet] - 10https://gerrit.wikimedia.org/r/1039795 (https://phabricator.wikimedia.org/T343364) [17:08:21] (03PS2) 10Dzahn: peopleweb: fix wrong unit, bytes are actually kilobytes in a script [puppet] - 10https://gerrit.wikimedia.org/r/1039795 (https://phabricator.wikimedia.org/T343364) [17:08:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:19] !log re-enabling pybal on lvs1019 after cable move T366361 [17:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:23] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [17:11:47] !log disabling pybal on lvs1018 to move traffic to lvs1020 in advance of cable move T366361 [17:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T352010)', diff saved to https://phabricator.wikimedia.org/P64209 and previous config saved to /var/cache/conftool/dbconfig/20240606-171336-ladsgroup.json [17:13:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [17:13:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:13:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [17:14:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T352010)', diff saved to https://phabricator.wikimedia.org/P64210 and previous config saved to /var/cache/conftool/dbconfig/20240606-171359-ladsgroup.json [17:14:12] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:14:20] ^ expected, [17:14:48] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:14:58] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on lvs1019.eqiad.wmnet with reason: moving lvs1018 link back to ssw1-e1-codfw [17:14:59] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 0:20:00 on lvs1019.eqiad.wmnet with reason: moving lvs1018 link back to ssw1-e1-codfw [17:15:05] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on lvs1018.eqiad.wmnet with reason: moving lvs1018 link back to ssw1-e1-codfw [17:15:19] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on lvs1018.eqiad.wmnet with reason: moving lvs1018 link back to ssw1-e1-codfw [17:15:38] sukhe: meh typo in my downtime that time sry [17:16:04] np at all. I don't think you can downtime all of these anyway, not the BGP ones for sure! [17:17:52] (03PS2) 10Kgraessle: InitaliseSettings-labs: Deploy Automoderator patroller workstream survey to cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038388 (https://phabricator.wikimedia.org/T362969) [17:23:43] !log re-enabling pybal on lvs1018 after cable move T366361 [17:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:46] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [17:23:48] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:23:52] nice job topranks! [17:24:02] just one last one :) [17:24:12] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:26:01] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on lvs1017.eqiad.wmnet with reason: moving lvs1017 link back to ssw1-e1-codfw [17:26:12] !log disabling pybal on lvs1017 to move traffic to lvs1020 in advance of cable move T366361 [17:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:15] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on lvs1017.eqiad.wmnet with reason: moving lvs1017 link back to ssw1-e1-codfw [17:31:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [17:31:01] (03CR) 10Dzahn: [C:03+2] peopleweb: fix wrong unit, bytes are actually kilobytes in a script [puppet] - 10https://gerrit.wikimedia.org/r/1039795 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [17:31:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance [17:31:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1247 (T364069)', diff saved to https://phabricator.wikimedia.org/P64211 and previous config saved to /var/cache/conftool/dbconfig/20240606-173121-marostegui.json [17:31:27] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:31:37] I won't be using my deploy window today [17:33:06] mutante: i'm back, and thank you, it's working. do you know if it was caused by something i did, or some bug on my side, or was it a bug on the gerrit side? [17:33:27] i haven't done anything out of ordinary today, just fetched and uploaded some changes with `git review` [17:33:46] hmm, i guess i cloned a repo [17:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:36:38] MatmaRex: glad it worked. all I can really say is it was only your user, not other users. there were really a lot of logins from you today [17:37:01] and I don't recall having to do this before [17:38:03] weird. thanks, i guess i'll start worrying if it happens again [17:41:59] (03PS1) 10Andrew Bogott: Keystone: map toolsbeta groups and users to keystone groups and users [puppet] - 10https://gerrit.wikimedia.org/r/1039799 (https://phabricator.wikimedia.org/T358496) [17:46:11] (03PS2) 10Andrew Bogott: Keystone: map toolsbeta groups and users to keystone groups and users [puppet] - 10https://gerrit.wikimedia.org/r/1039799 (https://phabricator.wikimedia.org/T358496) [17:46:14] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:46:19] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039799 (https://phabricator.wikimedia.org/T358496) (owner: 10Andrew Bogott) [17:46:22] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:46:48] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:48:30] !log re-enabling pybal on lvs1017 after cable move T366361 [17:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:34] T366361: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361 [17:48:48] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:49:14] RECOVERY - pybal on lvs1017 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:51:22] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:56:21] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - kamila@cumin1002" [17:57:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - kamila@cumin1002" [17:57:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [18:00:05] dduvall and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T1800). [18:03:01] o/ [18:04:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:08:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:11:32] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9868580 (10ssingh) Moving the links working out well (which I think this is the first time?) is a big take away from this task; glad to hear it went nicely! [18:13:12] dancy: o/ [18:14:17] (03PS1) 10Dzahn: peopleweb: adjust wording in warning emails about large home dirs [puppet] - 10https://gerrit.wikimedia.org/r/1039805 (https://phabricator.wikimedia.org/T343364) [18:14:35] (03PS3) 10Stoyofuku-wmf: Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625) [18:16:25] (03Abandoned) 10Stoyofuku-wmf: Refine list of pages where font size controls are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039310 (https://phabricator.wikimedia.org/T366334) (owner: 10Stoyofuku-wmf) [18:16:49] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039807 (https://phabricator.wikimedia.org/T361402) [18:16:51] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039807 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot) [18:17:09] !log thcipriani@deploy1002 Started deploy [releng/jenkins-deploy@3be9893] (releasing): (no justification provided) [18:17:48] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039807 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot) [18:17:52] !log thcipriani@deploy1002 Finished deploy [releng/jenkins-deploy@3be9893] (releasing): (no justification provided) (duration: 00m 43s) [18:27:25] (03PS1) 10Pmiazga: beta: introduce pl.wikivoyage.beta.wmcloud.org wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039809 (https://phabricator.wikimedia.org/T355281) [18:29:19] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.8 refs T361402 [18:29:20] (03CR) 10Jdlrobson: [C:03+1] Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625) (owner: 10Stoyofuku-wmf) [18:29:24] T361402: 1.43.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T361402 [18:30:07] (03CR) 10Dzahn: [C:03+2] peopleweb: adjust wording in warning emails about large home dirs [puppet] - 10https://gerrit.wikimedia.org/r/1039805 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [18:34:32] (03PS1) 10Dzahn: peopleweb: fix a variable name, minor formatting [puppet] - 10https://gerrit.wikimedia.org/r/1039811 (https://phabricator.wikimedia.org/T343364) [18:34:43] (03CR) 10CI reject: [V:04-1] peopleweb: fix a variable name, minor formatting [puppet] - 10https://gerrit.wikimedia.org/r/1039811 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [18:38:47] (03CR) 10Scott French: [C:03+1] "Thanks, Janis!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039788 (https://phabricator.wikimedia.org/T350846) (owner: 10JMeybohm) [18:38:58] (03PS2) 10Dzahn: peopleweb: fix a variable name, minor formatting [puppet] - 10https://gerrit.wikimedia.org/r/1039811 (https://phabricator.wikimedia.org/T343364) [18:38:58] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 5/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:39:04] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:39:20] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:39:22] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 7/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:39:22] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:39:58] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:40:04] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:40:17] (03PS1) 10JHathaway: jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1039812 [18:40:22] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:40:22] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:42:46] (03CR) 10JHathaway: [C:03+2] jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1039812 (owner: 10JHathaway) [18:45:01] (03PS1) 10Eevans: cassandra-dev2001: upgrade to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1039813 (https://phabricator.wikimedia.org/T350567) [18:45:59] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039813 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [18:48:43] (03CR) 10Ebrahim: commonswiki: Enable numeric wgCategoryCollation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037006 (https://phabricator.wikimedia.org/T362494) (owner: 10Anzx) [18:49:11] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9868739 (10wiki_willy) Ok, got it. Thanks for the info @dcaro. And just to confirm, cloudcephosd1001-1020 have the... [18:49:49] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9868742 (10kamila) @VRiley-WMF the reimage of wikikube-ctrl1001 was finally successful, I want to run a few more tests due to hav... [18:50:14] (03CR) 10Gergő Tisza: [C:03+1] beta: introduce pl.wikivoyage.beta.wmcloud.org wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039809 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [18:51:06] train seems ok. i am noticing an uptick in lock wait timeouts on wikidata but it seems congruent with rates for the same time of day over the past few days. no uptick in slow queries or anything else standing out. anyway, calling it a train [18:52:17] \o/ [18:55:23] (03CR) 10Andrew Bogott: [C:03+2] Keystone: map toolsbeta groups and users to keystone groups and users [puppet] - 10https://gerrit.wikimedia.org/r/1039799 (https://phabricator.wikimedia.org/T358496) (owner: 10Andrew Bogott) [18:56:16] (03CR) 10Scott French: "I should probably also mention: the patch-version bump in mesh.certificate also drops the currently unused cert-manager Certificate object" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [19:01:43] (03PS1) 10CDobbins: purged: set use_pki to true in magru [puppet] - 10https://gerrit.wikimedia.org/r/1039815 (https://phabricator.wikimedia.org/T360506) [19:04:18] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 30 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:04:53] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2791/co" [puppet] - 10https://gerrit.wikimedia.org/r/1039815 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [19:11:20] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:13:16] (03PS2) 10Jsn.sherman: InitialiseSettings: Enable AutoModerator on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038886 (https://phabricator.wikimedia.org/T362622) [19:16:24] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:18:11] (03CR) 10Ssingh: [C:03+1] "Looks good! Let's merge tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/1039815 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [19:18:23] (03CR) 10Dzahn: [C:03+2] peopleweb: fix a variable name, minor formatting [puppet] - 10https://gerrit.wikimedia.org/r/1039811 (https://phabricator.wikimedia.org/T343364) (owner: 10Dzahn) [19:18:33] (03CR) 10Ssingh: [C:03+1] "Rather, Monday, sorry :)" [puppet] - 10https://gerrit.wikimedia.org/r/1039815 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [19:30:49] !log xcollazo@deploy1002 Started deploy [airflow-dags/analytics@a8843e6]: Deploying latest DAGs to the analytics Airflow instance. T358707. [19:30:53] T358707: [Commons Impact Metrics] Create Airflow job that formats and loads the data to Cassandra for AQS - https://phabricator.wikimedia.org/T358707 [19:31:16] !log xcollazo@deploy1002 Finished deploy [airflow-dags/analytics@a8843e6]: Deploying latest DAGs to the analytics Airflow instance. T358707. (duration: 00m 26s) [19:36:21] (03CR) 10Ladsgroup: [C:03+1] Add a note that you cannot change wgCategoryCollation easily [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039767 (https://phabricator.wikimedia.org/T362494) (owner: 10Jforrester) [19:38:06] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9868908 (10Ladsgroup) I talked to some people at community resources and they said this should be first approved by affcom. Sorry. Let me contact them. [19:38:20] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:38:48] (03CR) 10Eevans: [C:03+2] cassandra-dev2001: upgrade to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1039813 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [19:39:43] (03PS1) 10Pmiazga: beta: Add server_alias for wikivoyage.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1039822 (https://phabricator.wikimedia.org/T355281) [19:41:03] (03CR) 10Gergő Tisza: [C:03+1] beta: Add server_alias for wikivoyage.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1039822 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [19:43:20] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 21 probes of 790 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [19:44:09] (03CR) 10Ladsgroup: [C:03+1] "I'll merge this around Monday." [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [19:47:32] 06SRE, 10Cassandra, 06Data-Persistence, 13Patch-For-Review: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9868944 (10Eevans) cassandra-dev2001-{a,b} have been upgraded to Java 11 (canaries). [19:50:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:31] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9868967 (10Ladsgroup) Clinic duty note: If I'm not mistaken this needs to be approved by a sponsor in WMF. Correct? [19:53:47] (03PS1) 10Scott French: data-gateway: bump image version to v1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039823 (https://phabricator.wikimedia.org/T364921) [19:55:49] (03CR) 10Krinkle: rpc: Update function call in RunSingleJob (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038785 (https://phabricator.wikimedia.org/T363839) (owner: 10Ladsgroup) [19:58:47] (03CR) 10Eevans: [C:03+1] data-gateway: bump image version to v1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039823 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T2000). [20:00:05] katherine_g, JSherman, pppery, Gerges, and toyofuku: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] Here [20:00:13] i can deploy today [20:00:17] howdy [20:00:18] Hi [20:00:19] here [20:00:19] a lot of patches! [20:00:50] Jan's gonna deploy my patch so I can shadow him, but we're last so [20:00:56] Will be lurking until then [20:01:03] toyofuku: ack [20:01:04] (03CR) 10Scott French: [C:03+2] data-gateway: bump image version to v1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039823 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [20:01:29] (03CR) 10Urbanecm: [C:03+2] InitialiseSettings: Enable AutoModerator on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038886 (https://phabricator.wikimedia.org/T362622) (owner: 10Jsn.sherman) [20:01:36] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038388 waiting on beta to be up to date, so this can be skipped [20:01:43] (03PS4) 10Wargo: $wmgThrottlingExceptions for idwiki and enwiki 2024-04-25 to 2024-08-25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031176 (https://phabricator.wikimedia.org/T363291) [20:01:56] (03Merged) 10jenkins-bot: data-gateway: bump image version to v1.0.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039823 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [20:02:07] (03Merged) 10jenkins-bot: InitialiseSettings: Enable AutoModerator on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038886 (https://phabricator.wikimedia.org/T362622) (owner: 10Jsn.sherman) [20:02:12] katherine_g: fwiw, i can get that out if you want me to [20:02:16] (03CR) 10Urbanecm: [C:03+2] $wmgThrottlingExceptions for idwiki and enwiki 2024-04-25 to 2024-08-25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031176 (https://phabricator.wikimedia.org/T363291) (owner: 10Wargo) [20:02:20] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart - ryankemper@cumin2002 - T366555 [20:02:20] (03CR) 10CI reject: [V:04-1] $wmgThrottlingExceptions for idwiki and enwiki 2024-04-25 to 2024-08-25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031176 (https://phabricator.wikimedia.org/T363291) (owner: 10Wargo) [20:02:27] (03PS4) 10Wargo: Assign applychangetags right to group "all" on plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031174 (https://phabricator.wikimedia.org/T363638) [20:02:30] (03CR) 10Urbanecm: [C:03+2] Assign applychangetags right to group "all" on plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031174 (https://phabricator.wikimedia.org/T363638) (owner: 10Wargo) [20:02:40] That is a massive throttle o_0 [20:02:40] urbanecm: sure sounds good [20:02:54] (03CR) 10CI reject: [V:04-1] $wmgThrottlingExceptions for idwiki and enwiki 2024-04-25 to 2024-08-25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031176 (https://phabricator.wikimedia.org/T363291) (owner: 10Wargo) [20:02:59] 300 on a 24 and 2 /29, on two wikis... for 3.5 months? [20:03:01] Reedy: oh... it's a couple of months [20:03:06] i missed the difference there [20:03:12] Pppery: yeah...why is that needed, please? [20:03:17] (03Merged) 10jenkins-bot: Assign applychangetags right to group "all" on plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031174 (https://phabricator.wikimedia.org/T363638) (owner: 10Wargo) [20:03:39] (03PS3) 10Kgraessle: InitaliseSettings-labs: Deploy Automoderator patroller workstream survey to cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038388 (https://phabricator.wikimedia.org/T362969) [20:03:39] 272 IPs? [20:03:47] (03CR) 10Urbanecm: [C:03+2] InitaliseSettings-labs: Deploy Automoderator patroller workstream survey to cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038388 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [20:03:51] I was just trying to get some action on the very old task as it should have been dealt with a while ago. If you want to decline that instead then feel free [20:04:08] It's very likely that all those IPs won't actually be used (based on how many networks are built), but still [20:04:25] (03Merged) 10jenkins-bot: InitaliseSettings-labs: Deploy Automoderator patroller workstream survey to cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038388 (https://phabricator.wikimedia.org/T362969) (owner: 10Kgraessle) [20:06:11] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1031174|Assign applychangetags right to group "all" on plwiktionary (T363638)]], [[gerrit:1038886|InitialiseSettings: Enable AutoModerator on trwiki (T362622)]], [[gerrit:1038388|InitaliseSettings-labs: Deploy Automoderator patroller workstream survey to cawiki (T362969)]] [20:06:19] T363638: Assign applychangetags right to group "all" on plwiktionary - https://phabricator.wikimedia.org/T363638 [20:06:19] T362622: Enable AutoModerator on tr.wiki - https://phabricator.wikimedia.org/T362622 [20:06:19] T362969: Deploy QuickSurvey for Automoderator patroller workstream survey - https://phabricator.wikimedia.org/T362969 [20:06:41] The experience where a somewhat time-sensitive task is filed, it gets no response for a month despite a request for a response, and then a patch is submitted, and then it gets no response for weeks is just very wrong. That's what motivated me to get involved with backport windows, by the way. [20:07:14] (03CR) 10Urbanecm: [C:04-2] "this is a very large exception; needs an explanation of why is it actually needed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031176 (https://phabricator.wikimedia.org/T363291) (owner: 10Wargo) [20:07:40] Pppery: i agree, it is definitely non-ideal. [20:07:56] (i'll comment on the task as welll) [20:08:39] !log urbanecm@deploy1002 wargo and urbanecm and jsn and kgraessle: Backport for [[gerrit:1031174|Assign applychangetags right to group "all" on plwiktionary (T363638)]], [[gerrit:1038886|InitialiseSettings: Enable AutoModerator on trwiki (T362622)]], [[gerrit:1038388|InitaliseSettings-labs: Deploy Automoderator patroller workstream survey to cawiki (T362969)]] synced to the testservers (https://wikitech.wikimedia.org/wiki [20:08:39] /Mwdebug) [20:09:02] JSherman: Pppery: can you test at mwdebug, please? [20:09:09] katherine_g: your beta patch'll be deployed, once beta updates itself [20:09:26] urbanecm: ack [20:10:31] My patch appears to work correctly - I can confirm "applychangetags" shows up at the top of https://pl.wiktionary.org/wiki/Specjalna:Grupy_u%C5%BCytkownik%C3%B3w?uselang=en with X-Wikimedia-Debug on and doesn't with it off [20:11:10] urbanecm: ack [20:11:18] thanks Pppery [20:11:25] waiting for JSherman [20:11:34] urbanecm: just finished; looks good [20:11:37] ty [20:11:52] !log urbanecm@deploy1002 wargo and urbanecm and jsn and kgraessle: Continuing with sync [20:12:05] (03PS1) 10Scott French: data-gateway: bump image version to v1.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039829 (https://phabricator.wikimedia.org/T364921) [20:12:07] (03PS4) 10Stoyofuku-wmf: Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625) [20:13:29] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [20:13:41] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [20:14:56] (03PS2) 10GergesShamon: [mswiktionary] Rename namespace "Wiktionary" to "Wikikamus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039729 (https://phabricator.wikimedia.org/T366549) [20:15:04] (03CR) 10Urbanecm: [C:03+2] [mswiktionary] Rename namespace "Wiktionary" to "Wikikamus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039729 (https://phabricator.wikimedia.org/T366549) (owner: 10GergesShamon) [20:15:31] (03CR) 10Urbanecm: [C:03+2] Drop logging level for unsupported providers to DEBUG [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038714 (https://phabricator.wikimedia.org/T366519) (owner: 10Urbanecm) [20:15:34] (03CR) 10Eevans: [C:03+1] data-gateway: bump image version to v1.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039829 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [20:15:35] (03CR) 10Urbanecm: [C:03+2] Improve navigation link handling in CommunityConfiguration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038843 (https://phabricator.wikimedia.org/T364938) (owner: 10Sergio Gimeno) [20:15:42] (03Merged) 10jenkins-bot: [mswiktionary] Rename namespace "Wiktionary" to "Wikikamus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039729 (https://phabricator.wikimedia.org/T366549) (owner: 10GergesShamon) [20:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [20:16:25] FIRING: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 12.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:18:10] (03Merged) 10jenkins-bot: Drop logging level for unsupported providers to DEBUG [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038714 (https://phabricator.wikimedia.org/T366519) (owner: 10Urbanecm) [20:18:15] (03Merged) 10jenkins-bot: Improve navigation link handling in CommunityConfiguration [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038843 (https://phabricator.wikimedia.org/T364938) (owner: 10Sergio Gimeno) [20:18:29] not enough workers. that does not sound particularly good [20:18:49] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [20:19:19] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [20:20:21] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1031174|Assign applychangetags right to group "all" on plwiktionary (T363638)]], [[gerrit:1038886|InitialiseSettings: Enable AutoModerator on trwiki (T362622)]], [[gerrit:1038388|InitaliseSettings-labs: Deploy Automoderator patroller workstream survey to cawiki (T362969)]] (duration: 14m 10s) [20:20:28] T363638: Assign applychangetags right to group "all" on plwiktionary - https://phabricator.wikimedia.org/T363638 [20:20:29] T362622: Enable AutoModerator on tr.wiki - https://phabricator.wikimedia.org/T362622 [20:20:29] T362969: Deploy QuickSurvey for Automoderator patroller workstream survey - https://phabricator.wikimedia.org/T362969 [20:20:37] deployed first batch of patches [20:20:55] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [20:21:00] thanks! [20:21:25] RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:33] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [20:21:58] (03CR) 10Scott French: [C:03+2] data-gateway: bump image version to v1.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039829 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [20:22:12] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1039729|[mswiktionary] Rename namespace "Wiktionary" to "Wikikamus" (T366549)]], [[gerrit:1038843|Improve navigation link handling in CommunityConfiguration (T364938 T365504 T360954)]], [[gerrit:1038714|Drop logging level for unsupported providers to DEBUG (T366519 T360954)]] [20:22:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 12.34% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:22:25] T366549: Rename namespace "Wiktionary" to "Wikikamus" in ms.wiktionary.org - https://phabricator.wikimedia.org/T366549 [20:22:25] T364938: CommunityConfiguration edit links appear at unrelated pages - https://phabricator.wikimedia.org/T364938 [20:22:28] T365504: Disable Actions for Special:CommunityConfiguration edit form - https://phabricator.wikimedia.org/T365504 [20:22:30] T360954: Deploy CommunityConfiguration to testwiki - https://phabricator.wikimedia.org/T360954 [20:22:30] T366519: CommunityConfiguration: WikiPageConfigReader should not log WARNINGs when encountering a provider it does not support - https://phabricator.wikimedia.org/T366519 [20:22:47] (03Merged) 10jenkins-bot: data-gateway: bump image version to v1.0.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039829 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [20:24:30] !log urbanecm@deploy1002 urbanecm and sgimeno and gergesshamon: Backport for [[gerrit:1039729|[mswiktionary] Rename namespace "Wiktionary" to "Wikikamus" (T366549)]], [[gerrit:1038843|Improve navigation link handling in CommunityConfiguration (T364938 T365504 T360954)]], [[gerrit:1038714|Drop logging level for unsupported providers to DEBUG (T366519 T360954)]] synced to the testservers (https://wikitech.wikimedia.org/wiki [20:24:30] /Mwdebug) [20:24:41] Pppery: can you test your patch, please? [20:24:50] What patch [20:25:07] sorry [20:25:10] Gerges: [20:25:15] can you test the namespace patch please? [20:25:22] wrong person, my apologies [20:26:00] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [20:26:14] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [20:27:15] urbanecm: I tested the patch, But pages in the old namespace are supposed to be moved to a new namespace [20:27:25] is that not happening? [20:27:36] No [20:27:53] Gerges: can you link an example, please? [20:28:21] https://ms.wiktionary.org/wiki/Wikikamus:Kedai_Kopi#Penukaran_ruang_nama_dan_tajuk_tab_%22Wiktionary%22_ke_%22Wikikamus%22 [20:28:50] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [20:29:25] Gerges: that page is in the Wikikamus namespace? [20:29:37] (it is also in the URL) [20:29:41] * urbanecm does not see the bug [20:30:25] Yes [20:30:38] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [20:30:54] Gerges: sorry, i am confused now. is the patch working as expected? [20:30:58] I NULLEDITed that page, which fixed the displayed title and the redlink but it still says Wiktionary in the sidebar tab. Is that the bug? [20:31:27] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [20:32:36] https://ms.m.wiktionary.org/wiki/Wikikamus:Kedai_Kopi#Penukaran_ruang_nama_dan_tajuk_tab_%22Wiktionary%22_ke_%22Wikikamus%22 [20:32:40] https://ms.m.wiktionary.org/w/index.php?title=Wikikamus:Kedai_Kopi [20:32:56] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [20:33:04] The patch now works fine [20:33:12] Probably my null edit helped [20:33:14] Pppery: good catch. i didn't see that because my interface language is set to english. https://ms.wiktionary.org/wiki/MediaWiki:Nstab-project is the cause [20:33:18] !log urbanecm@deploy1002 urbanecm and sgimeno and gergesshamon: Continuing with sync [20:33:22] anyway, let's proceed then :) [20:34:45] Should I edit the MediaWiki:Nstab-project? [20:35:18] You don't have the rights to, that needs a mswiktionary admin [20:35:27] But you can point out on-wiki that it needs to be done [20:37:15] toyofuku: would it be OK with you if i deploy one more patch of me and then hand the window over to you? i also don't mind handing over now and getting back to my patch once you're done :). [20:37:23] Pppery: that form is outdated then buy the looks, Reedy started otaking them out of site-requests ages ago because maintenance script runs aren't site config changes anyway [20:37:39] urbanecm: go right ahead! [20:37:42] thanks! [20:37:47] Happy to go last since I'm ~learning~ [20:37:49] (03PS5) 10Urbanecm: testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) [20:38:00] (03CR) 10Urbanecm: [C:03+2] testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) (owner: 10Urbanecm) [20:38:24] (context for p858snake's comment: T366825) [20:38:25] T366825: Request to move translatable page: Movement Strategy and Governance/Termbase/Table (2024) - https://phabricator.wikimedia.org/T366825 [20:38:46] (03Merged) 10jenkins-bot: testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) (owner: 10Urbanecm) [20:39:24] (03PS5) 10Stoyofuku-wmf: Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625) [20:39:56] Pppery: p858snake|cloud: form updated :) [20:41:54] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1039729|[mswiktionary] Rename namespace "Wiktionary" to "Wikikamus" (T366549)]], [[gerrit:1038843|Improve navigation link handling in CommunityConfiguration (T364938 T365504 T360954)]], [[gerrit:1038714|Drop logging level for unsupported providers to DEBUG (T366519 T360954)]] (duration: 19m 42s) [20:42:02] T366549: Rename namespace "Wiktionary" to "Wikikamus" in ms.wiktionary.org - https://phabricator.wikimedia.org/T366549 [20:42:02] T364938: CommunityConfiguration edit links appear at unrelated pages - https://phabricator.wikimedia.org/T364938 [20:42:03] T365504: Disable Actions for Special:CommunityConfiguration edit form - https://phabricator.wikimedia.org/T365504 [20:42:03] T360954: Deploy CommunityConfiguration to testwiki - https://phabricator.wikimedia.org/T360954 [20:42:04] T366519: CommunityConfiguration: WikiPageConfigReader should not log WARNINGs when encountering a provider it does not support - https://phabricator.wikimedia.org/T366519 [20:42:10] Gerges: deployed! [20:42:14] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1038701|testwiki: Enable CommunityConfiguration (T360954)]] [20:44:40] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:1038701|testwiki: Enable CommunityConfiguration (T360954)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:45:08] urbanecm: Thanks [20:46:01] !log urbanecm@deploy1002 urbanecm: Continuing with sync [20:48:34] (03CR) 10Volans: "if you want we can totally merge this even without the new hosts for testing as long as the Dell part isn't affected. So that the diff lat" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [20:50:29] !log mwscript extensions/GrowthExperiments/maintenance/migrateCommunityConfig.php --wiki=testwiki # T360954 [20:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:33] T360954: Deploy CommunityConfiguration to testwiki - https://phabricator.wikimedia.org/T360954 [20:54:24] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1038701|testwiki: Enable CommunityConfiguration (T360954)]] (duration: 12m 09s) [20:54:42] urbanecm: mine is looking good [20:54:47] katherine_g: thanks for the info! [20:54:53] and, done :) [20:55:06] toyofuku: over to you and Jan :) [20:55:11] Thank you!! [20:55:16] urbanecm: thanks! [20:56:18] np [20:56:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625) (owner: 10Stoyofuku-wmf) [20:57:19] (03Merged) 10jenkins-bot: Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625) (owner: 10Stoyofuku-wmf) [20:57:36] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1038876|Disable font size options on specified pages for all wikis (T366625)]] [20:57:42] T366625: Force small font size for wikis with accessibility menu on for anonymous users - https://phabricator.wikimedia.org/T366625 [20:59:59] !log jdrewniak@deploy1002 jdrewniak and toyofuku: Backport for [[gerrit:1038876|Disable font size options on specified pages for all wikis (T366625)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:01:54] !log jdrewniak@deploy1002 jdrewniak and toyofuku: Continuing with sync [21:05:32] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:08:15] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9869298 (10Scott_French) Many thanks to @Eevans for humoring my experiments. The results are in, and it seems that upgrading from gocql v1.2.0 to v1.... [21:08:30] 06SRE, 10Cassandra, 06Data Products, 06serviceops, and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9869304 (10Scott_French) [21:10:27] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1038876|Disable font size options on specified pages for all wikis (T366625)]] (duration: 12m 50s) [21:10:30] T366625: Force small font size for wikis with accessibility menu on for anonymous users - https://phabricator.wikimedia.org/T366625 [21:11:14] Thank you all so much! [21:13:14] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869306 (10RobH) p:05Medium→03High @Jclark-ctr or @VRiley-WMF: Would one of you be able to take care of this on your next on-site visit? We have light on the drm... [21:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:17:22] (03PS1) 10Bking: dse-k8s: replace 'airflow-analytics-test' ns with 'airflow' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039838 (https://phabricator.wikimedia.org/T363001) [21:18:38] PROBLEM - Host elastic2099 is DOWN: PING CRITICAL - Packet loss = 100% [21:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:21:40] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2082:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:27:21] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart - ryankemper@cumin2002 - T366555 [21:32:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869358 (10wiki_willy) a:03Jclark-ctr [21:34:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869359 (10wiki_willy) Valerie is on vacation, so assigning to John [21:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:35:32] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:44:37] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869378 (10RobH) [21:45:41] (03CR) 10Ryan Kemper: [C:03+1] dse-k8s: replace 'airflow-analytics-test' ns with 'airflow' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039838 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [21:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:00] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart - ryankemper@cumin2002 - T366555 [21:46:55] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2082:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:48:14] (03CR) 10Bking: [C:03+2] dse-k8s: replace 'airflow-analytics-test' ns with 'airflow' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039838 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [21:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:50:03] is anyone around who would like to check on the status of a maintenance script for me? (not urgent) https://phabricator.wikimedia.org/T315510#9842466 [21:51:13] (03Merged) 10jenkins-bot: dse-k8s: replace 'airflow-analytics-test' ns with 'airflow' [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039838 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [21:51:14] MatmaRex: it has crashed at some point: https://phabricator.wikimedia.org/P64214 [21:51:44] grumble. thanks [21:54:44] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9869385 (10wiki_willy) @Papaul & @Jhancock.wm - was this one completed already via a different task? [21:56:40] RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2082:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:56:55] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:01:40] RESOLVED: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:04:28] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9869394 (10Papaul) @wiki_willy yes @https://phabricator.wikimedia.org/T354685 [22:11:40] FIRING: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:12:55] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:13:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:13:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:15:11] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9869411 (10wiki_willy) T354685 looks like it was upgraded in January, but this task was created afterwards on March 25. @herron - do you still need this request d... [22:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:00] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:16:40] RESOLVED: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:56] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:20:02] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:20:57] well that explains it not loading [22:21:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.233 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:21:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52065 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:21:54] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:22:55] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:27:39] jouncebot: next [22:27:40] In 7 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240607T0600) [22:27:45] jouncebot: refresh [22:27:45] I refreshed my knowledge about deployments. [22:27:47] jouncebot: next [22:27:48] In 0 hour(s) and 32 minute(s): bd808's backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T2300) [22:27:55] RESOLVED: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2065:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:28:01] VIP service [22:28:52] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9869436 (10Dzahn) [22:29:09] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1032626 (owner: 10BCornwall) [22:29:47] dancy: :) I needed to test my fix for T366794 and conveniently I also need to deploy a couple of wikitech config changes [22:29:47] T366794: Did not update deployment calendar - https://phabricator.wikimedia.org/T366794 [22:30:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869444 (10Jclark-ctr) Installed cross connect link came up on port. cableid #5229 [22:31:30] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9869438 (10Dzahn) User has signed L3 and already has an NDA. We are blocked on approvals from one of: @odimitrijevic, @Milimetric, @WDoranWMF or @Ahoelzl for Anal... [22:32:58] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9869453 (10Dzahn) Thanks all! We are blocked on approval from one of: @odimitrijevic, @Milimetric, @WDoranWMF, @Ahoelzl for Analytics team approval. [22:33:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869454 (10RobH) [22:33:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Patch circiut CRT-008647 - https://phabricator.wikimedia.org/T366102#9869455 (10RobH) 05Open→03Resolved Looks good to me on this end, thank you! [22:34:15] thcipriani: I invented a new backport window today that will start at the top of the hour. Do you trust me to figure out how to actually do the needful for myself (2 wikitech config changes), or should I wait for a more recently practiced deployer? [22:34:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9869459 (10Dzahn) 05Open→03Stalled p:05Triage→03High [22:36:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9869461 (10Dzahn) 05Open→03In progress p:05Triage→03High [22:38:51] 06SRE, 10LDAP-Access-Requests: Grant Access to nda for Ricki Jay - https://phabricator.wikimedia.org/T365138#9869497 (10Dzahn) @RickiJay-WMDE Do you still have the login issue? [22:39:01] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9869493 (10Dzahn) Yes, that seems correct per https://wikitech.wikimedia.org/wiki/Volunteer_NDA [22:40:10] FIRING: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:55] RESOLVED: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:43:30] (03PS2) 10BryanDavis: wikitech: Update Phabricator Conduit calls to disable/enable users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039307 (https://phabricator.wikimedia.org/T366587) [22:43:30] (03PS3) 10BryanDavis: wikitech: Replace OSM class in Gerrit blocking hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038749 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [22:43:30] (03PS4) 10BryanDavis: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [22:48:57] 06SRE, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#9869512 (10Dzahn) Per T323208#8399531 and if going by the literal ticket title, I think this is resolved all this time. A change to an apache config... [22:50:22] 06SRE, 10Wikimedia-Mailing-lists: Fix lists.wmcloud.org - https://phabricator.wikimedia.org/T290110#9869524 (10Dzahn) While lists.wmcloud.org is still in DNS ( has address 185.15.56.43) it's "Unable to connect" in a browser. Also https://polymorphic.lists.wmcloud.org/ is not reachable nowadays. Do these thin... [22:56:10] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:56:35] (03PS1) 10Ncmonitor: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1039847 [22:56:39] (03PS1) 10Ncmonitor: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1039848 [22:56:42] (03PS1) 10Ncmonitor: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1039849 [22:57:02] (03CR) 10CI reject: [V:04-1] Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor) [22:57:55] RESOLVED: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:00:04] bd808: bd808's backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240606T2300). Please do the needful. [23:00:04] bd808 and bd808: A patch you scheduled for bd808's backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:26] so many pings for bd808 there :) [23:00:43] * bd808 follows the script [23:00:44] I can do the deploys today! [23:02:14] bd808: are you ready? [23:02:19] bd808: yup! [23:02:32] let's start with 1039307 [23:03:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039307 (https://phabricator.wikimedia.org/T366587) (owner: 10BryanDavis) [23:03:51] (03Merged) 10jenkins-bot: wikitech: Update Phabricator Conduit calls to disable/enable users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039307 (https://phabricator.wikimedia.org/T366587) (owner: 10BryanDavis) [23:04:09] !log bd808@deploy1002 Started scap: Backport for [[gerrit:1039307|wikitech: Update Phabricator Conduit calls to disable/enable users (T366587)]] [23:04:13] T366587: Update Phabricator BlockIpComplete hook to use "user.edit" Conduit API - https://phabricator.wikimedia.org/T366587 [23:06:31] !log bd808@deploy1002 bd808: Backport for [[gerrit:1039307|wikitech: Update Phabricator Conduit calls to disable/enable users (T366587)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:07:17] This is a wikitech-only change and we can't test wikitech from the testservers yet, so continuing with the sync [23:07:23] !log bd808@deploy1002 bd808: Continuing with sync [23:08:51] (03PS1) 10DErenrich: Add citation-needed-api to toolforge's prometheus [puppet] - 10https://gerrit.wikimedia.org/r/1039850 (https://phabricator.wikimedia.org/T363371) [23:10:36] This is all so much fancier on the cli than in the olden times. The logspam-watch script is especially cute [23:15:24] (03CR) 10DErenrich: "Adding t.aavi as a reviewer per Bryan Davis on #wikimedia-cloud" [puppet] - 10https://gerrit.wikimedia.org/r/1039850 (https://phabricator.wikimedia.org/T363371) (owner: 10DErenrich) [23:16:10] !log bd808@deploy1002 Finished scap: Backport for [[gerrit:1039307|wikitech: Update Phabricator Conduit calls to disable/enable users (T366587)]] (duration: 12m 01s) [23:16:14] T366587: Update Phabricator BlockIpComplete hook to use "user.edit" Conduit API - https://phabricator.wikimedia.org/T366587 [23:16:33] ok, now I can actually test it on wikitech... [23:18:33] that one worked :) [23:18:44] bd808: ready for the second patch? [23:18:47] bd808: yup! [23:19:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038749 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [23:20:29] (03Merged) 10jenkins-bot: wikitech: Replace OSM class in Gerrit blocking hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038749 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [23:20:47] !log bd808@deploy1002 Started scap: Backport for [[gerrit:1038749|wikitech: Replace OSM class in Gerrit blocking hook (T161553)]] [23:20:50] T161553: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553 [23:22:10] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:23:07] !log bd808@deploy1002 taavi and bd808: Backport for [[gerrit:1038749|wikitech: Replace OSM class in Gerrit blocking hook (T161553)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:23:27] same deal. this can't be tested until it's live on wikitech. [23:23:29] !log bd808@deploy1002 taavi and bd808: Continuing with sync [23:23:55] RESOLVED: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:25:50] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [23:31:02] RECOVERY - Host mr1-eqsin.oob is UP: PING WARNING - Packet loss = 90%, RTA = 347.33 ms [23:32:11] !log bd808@deploy1002 Finished scap: Backport for [[gerrit:1038749|wikitech: Replace OSM class in Gerrit blocking hook (T161553)]] (duration: 11m 24s) [23:32:17] T161553: Remove OpenStackManager from Wikitech - https://phabricator.wikimedia.org/T161553 [23:33:03] time for another live test... [23:37:26] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1039592 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1039592 (owner: 10TrainBranchBot) [23:39:04] I just want to apologize for scheduling the rate limit patch for the backport window today - I should have caught both that it failed CI and that the rationale made no sense for numerous reasons I've now written up at T363291, which I've closed as declined. Instead I created unnecessary work for others. [23:39:05] T363291: Mass Account Creation exception at idwiki and enwiki 2024-04-25 to 2024-08-25 - https://phabricator.wikimedia.org/T363291 [23:39:55] FIRING: [15x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:42:01] all done with my extra backport window [23:42:28] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 232.15 ms [23:43:10] FIRING: [15x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:10] RESOLVED: [9x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2080:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:51:30] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on elastic2100 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:53:10] FIRING: [18x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2080:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:28] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on elastic2083 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:54:55] FIRING: [18x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2080:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:56:14] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on elastic2081 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state