[00:11:57] (03CR) 10Andrea Denisse: [C: 03+2] grafana: Ensure the grafana1002 hosts uses Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [00:17:47] (03PS1) 10Jeena Huneidi: disable wmgUsePageViewInfo and wmgUseIPInfo [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991435 [00:21:05] (03CR) 10Jeena Huneidi: [C: 03+2] disable wmgUsePageViewInfo and wmgUseIPInfo [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991435 (owner: 10Jeena Huneidi) [00:22:09] (03Merged) 10jenkins-bot: disable wmgUsePageViewInfo and wmgUseIPInfo [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/991435 (owner: 10Jeena Huneidi) [00:22:52] (03PS1) 10Dzahn: phabricator: temp test of repo syncing, using gitlab2003 spare host [puppet] - 10https://gerrit.wikimedia.org/r/991439 (https://phabricator.wikimedia.org/T334519) [00:23:38] (03PS2) 10Dzahn: phabricator: temp test of repo syncing, using gitlab2003 spare host [puppet] - 10https://gerrit.wikimedia.org/r/991439 (https://phabricator.wikimedia.org/T334519) [00:38:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991449 [00:39:04] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991449 (owner: 10TrainBranchBot) [01:16:06] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991449 (owner: 10TrainBranchBot) [01:21:37] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Kappakayala) Hi @jeena , looking at the comments looks like this is not related to mw-on-k8s migration as I see @Clement_Goubert reverted to bare metal... [02:32:08] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jeena) Yes, I think that is the correct assessment, so we still need to figure out how to solve this issue. [02:39:19] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:19] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:23:43] (ProbeDown) firing: (6) Service miscweb1003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:21:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:26] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) Do you have a trace of how these calls occur? In theory I think this shouldn't be happening because thumbnailing requests should be directed to Th... [04:55:37] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) What's supposed to happen is that the ThumbnailRender job makes an HTTP request to `http://ms-fe.svc.codfw.wmnet/wikipedia/commons/thumb/4/4f/Ambr... [05:08:35] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 707.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:15:21] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:25:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T352010)', diff saved to https://phabricator.wikimedia.org/P54862 and previous config saved to /var/cache/conftool/dbconfig/20240118-052556-ladsgroup.json [05:26:02] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:27:04] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) Also there are ~100 errors per minute. ThumbnailRender tries to create four thumbnails per upload. There are usually 5-10 uploads per minute on Co... [05:41:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P54863 and previous config saved to /var/cache/conftool/dbconfig/20240118-054103-ladsgroup.json [05:48:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [05:48:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1162.eqiad.wmnet with reason: Maintenance [05:52:41] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269 (10Marostegui) [05:52:53] 10SRE, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269 (10Marostegui) [05:52:58] (03PS1) 10Marostegui: mariadb: Provision es10[35-40] [puppet] - 10https://gerrit.wikimedia.org/r/991468 (https://phabricator.wikimedia.org/T355269) [05:52:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [05:53:12] (03Abandoned) 10Samtar: InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar) [05:53:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [05:53:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [05:54:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [05:54:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2104 (T354336)', diff saved to https://phabricator.wikimedia.org/P54864 and previous config saved to /var/cache/conftool/dbconfig/20240118-055419-marostegui.json [05:54:23] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [05:56:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P54865 and previous config saved to /var/cache/conftool/dbconfig/20240118-055609-ladsgroup.json [05:56:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T354336)', diff saved to https://phabricator.wikimedia.org/P54866 and previous config saved to /var/cache/conftool/dbconfig/20240118-055643-marostegui.json [05:58:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Provision es10[35-40] [puppet] - 10https://gerrit.wikimedia.org/r/991468 (https://phabricator.wikimedia.org/T355269) (owner: 10Marostegui) [05:59:41] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 10.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:06:13] (03PS1) 10Marostegui: site.pp: Add es10[35-40] [puppet] - 10https://gerrit.wikimedia.org/r/991470 (https://phabricator.wikimedia.org/T355269) [06:08:25] (03CR) 10Marostegui: [C: 03+2] site.pp: Add es10[35-40] [puppet] - 10https://gerrit.wikimedia.org/r/991470 (https://phabricator.wikimedia.org/T355269) (owner: 10Marostegui) [06:08:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269 (10Marostegui) >>! In T355269#9467405, @Jclark-ctr wrote: > If you can update installation instructions and update preseed.yaml, and site.pp if needed Thanks Done! [06:10:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269 (10Marostegui) a:05Marostegui→03Jclark-ctr [06:11:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T352010)', diff saved to https://phabricator.wikimedia.org/P54867 and previous config saved to /var/cache/conftool/dbconfig/20240118-061116-ladsgroup.json [06:11:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [06:11:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:11:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [06:11:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T352010)', diff saved to https://phabricator.wikimedia.org/P54868 and previous config saved to /var/cache/conftool/dbconfig/20240118-061138-ladsgroup.json [06:11:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P54869 and previous config saved to /var/cache/conftool/dbconfig/20240118-061150-marostegui.json [06:22:01] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [06:25:03] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [06:26:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P54870 and previous config saved to /var/cache/conftool/dbconfig/20240118-062657-marostegui.json [06:38:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269 (10Marostegui) [06:40:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269 (10Marostegui) [06:42:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T354336)', diff saved to https://phabricator.wikimedia.org/P54871 and previous config saved to /var/cache/conftool/dbconfig/20240118-064203-marostegui.json [06:42:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [06:42:08] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [06:42:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [06:42:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T354336)', diff saved to https://phabricator.wikimedia.org/P54872 and previous config saved to /var/cache/conftool/dbconfig/20240118-064225-marostegui.json [06:43:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269 (10Marostegui) [06:44:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T354336)', diff saved to https://phabricator.wikimedia.org/P54873 and previous config saved to /var/cache/conftool/dbconfig/20240118-064456-marostegui.json [06:52:59] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10ayounsi) @RobH I could have been more clear :) We didn't renew support on that item, but yet it shows as having "core support", as Juniper already made mistake in the past I was wondering if it could be possible to check... [07:00:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P54874 and previous config saved to /var/cache/conftool/dbconfig/20240118-070003-marostegui.json [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T0700) [07:00:04] kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T0700). nyaa~ [07:10:23] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:10:40] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) If {T351400} is the cause, then I am unsure if this is an unbreak now, as that code has been running since January 5 (see https://grafana.wiki... [07:14:30] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Joe) No that is NOT the cause. The problem is also happening on jobrunners, I don't think that script actually spawns jobs. I think the root cause is t... [07:15:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P54875 and previous config saved to /var/cache/conftool/dbconfig/20240118-071509-marostegui.json [07:23:44] (ProbeDown) firing: (6) Service miscweb1003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:24:54] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) It was run with a `--use-jobqueue` parameter, that's pretty indicative of spawning jobs. [07:25:21] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Joe) >>! In T355243#9467881, @kostajh wrote: > If {T351400} is the cause, then I am unsure if this is an unbreak now, as that code has been running sin... [07:28:40] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) >>! In T355243#9467899, @Joe wrote: >>>! In T355243#9467881, @kostajh wrote: >> If {T351400} is the cause, then I am unsure if this is an unbr... [07:30:13] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:30:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T354336)', diff saved to https://phabricator.wikimedia.org/P54876 and previous config saved to /var/cache/conftool/dbconfig/20240118-073016-marostegui.json [07:30:18] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) @Joe do you want us to stop the script for now, and switch to not using the job queue? [07:30:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [07:30:22] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [07:30:23] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:30:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [07:30:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:30:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:30:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T354336)', diff saved to https://phabricator.wikimedia.org/P54877 and previous config saved to /var/cache/conftool/dbconfig/20240118-073054-marostegui.json [07:31:27] RECOVERY - BFD status on cr1-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:32:25] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Joe) >>! In T355243#9467924, @kostajh wrote: > @Joe do you want us to stop the script for now, and switch to not using the job queue? I mean, right no... [07:33:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T354336)', diff saved to https://phabricator.wikimedia.org/P54878 and previous config saved to /var/cache/conftool/dbconfig/20240118-073319-marostegui.json [07:35:15] (03PS2) 10Hubaishan: Restrict pagequality-validate right to patroller in arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991379 (https://phabricator.wikimedia.org/T354503) [07:36:15] 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10ArthurTaylor) Thanks @Dzahn . So how do I get shell access with restricted group? What are the next steps here? [07:38:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:39:21] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) >>! In T355243#9467936, @Joe wrote: >>>! In T355243#9467924, @kostajh wrote: >> @Joe do you want us to stop the script for now, and switch to... [07:43:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:48:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P54879 and previous config saved to /var/cache/conftool/dbconfig/20240118-074825-marostegui.json [07:59:29] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:00:05] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T0800). [08:00:05] anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:50] o/ [08:03:25] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) The PhotoDNA API docs say //"Alternatively, a publicly accessible URL of an image (gif, jpeg, png, bmp, or tiff) could be provided ... response ti... [08:03:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P54880 and previous config saved to /var/cache/conftool/dbconfig/20240118-080332-marostegui.json [08:07:01] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) >>! In T355243#9468007, @Tgr wrote: > The PhotoDNA API docs say //"Alternatively, a publicly accessible URL of an image (gif, jpeg, png, bmp,... [08:09:57] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:18:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T354336)', diff saved to https://phabricator.wikimedia.org/P54881 and previous config saved to /var/cache/conftool/dbconfig/20240118-081838-marostegui.json [08:18:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [08:18:44] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [08:18:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [08:19:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2138:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54882 and previous config saved to /var/cache/conftool/dbconfig/20240118-081900-marostegui.json [08:20:20] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) The above patches should get us as far as DHCP. DHCP is going to be the next big challenge to solve, partly because of the setback of Opti... [08:20:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. One suggestion inline, but feel free to ignore, it's just a cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/991361 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [08:21:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54883 and previous config saved to /var/cache/conftool/dbconfig/20240118-082130-marostegui.json [08:24:45] (03CR) 10Muehlenhoff: "That needs to be reverted, grafana hosts are still on Buster and can only switch to Puppet 7 when they are on Bookworm. In fact, Puppet is" [puppet] - 10https://gerrit.wikimedia.org/r/991391 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [08:27:29] 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10ItamarWMDE) @thcipriani @MoritzMuehlenhoff @DZahn, In the same way I and @HasanAkgun_WMDE needed `restricted` access, so does Arthur. As a senior engineer in our... [08:28:45] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Marostegui) Any update from Dell? [08:31:36] (03PS1) 10SCherukuwada: Add Google's TXT Verification entry to www for wikifunctions.org. [dns] - 10https://gerrit.wikimedia.org/r/991527 [08:36:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P54884 and previous config saved to /var/cache/conftool/dbconfig/20240118-083636-marostegui.json [08:37:12] (03PS2) 10SCherukuwada: Add Google's TXT Verification entry to www for wikifunctions.org. [dns] - 10https://gerrit.wikimedia.org/r/991527 (https://phabricator.wikimedia.org/T353993) [08:37:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:39:51] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) Thank you everyone for jumping on this. It's not clear to me at this point if this is train-related after all. Should this ticket still be con... [08:42:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:46:48] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) >>! In T355243#9468080, @jnuche wrote: > Thank you everyone for jumping on this. > > It's not clear to me at this point if this is train-rela... [08:50:39] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) >>! In T355243#9468085, @kostajh wrote: >>>! In T355243#9468080, @jnuche wrote: >> Thank you everyone for jumping on this. >> >> It's not cle... [08:51:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P54885 and previous config saved to /var/cache/conftool/dbconfig/20240118-085143-marostegui.json [09:00:05] jnuche and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T0900). [09:00:17] hi, still waiting on the outcome of https://phabricator.wikimedia.org/T355243 before making a decision about rolling the train [09:02:31] (03PS15) 10Majavah: P:toolforge::mailrelay: reject mail not using Toolforge domains [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) [09:02:33] (03PS15) 10Majavah: P:toolforge::mailrelay: add List-Id header for tool mail [puppet] - 10https://gerrit.wikimedia.org/r/971893 [09:03:39] (03CR) 10Ayounsi: "Overall lgtm, some comments inline." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/991356 (https://phabricator.wikimedia.org/T355225) (owner: 10Cathal Mooney) [09:06:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54886 and previous config saved to /var/cache/conftool/dbconfig/20240118-090649-marostegui.json [09:06:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [09:06:55] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:07:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [09:07:08] (03CR) 10Majavah: [C: 03+2] P:toolforge::mailrelay: reject mail not using Toolforge domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971892 (https://phabricator.wikimedia.org/T354112) (owner: 10Majavah) [09:07:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T354336)', diff saved to https://phabricator.wikimedia.org/P54887 and previous config saved to /var/cache/conftool/dbconfig/20240118-090712-marostegui.json [09:09:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T354336)', diff saved to https://phabricator.wikimedia.org/P54888 and previous config saved to /var/cache/conftool/dbconfig/20240118-090941-marostegui.json [09:10:18] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc1046 [puppet] - 10https://gerrit.wikimedia.org/r/990994 (owner: 10Effie Mouzeli) [09:12:43] !log stopped MediaModeration scanning script [09:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:46] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) [09:15:24] !log T351400 running on a tmux session `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --sleep 0 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-0-non-jobqueue.txt` [09:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:28] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) I've stopped the script running now and have removed {T354432} as a parent task. [09:15:29] T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400 [09:20:45] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) @kostajh @Dreamy_Jazz thank you, I can see the error rate going down. I'm going to proceed with the train. [09:22:49] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991535 (https://phabricator.wikimedia.org/T354432) [09:22:51] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991535 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot) [09:23:50] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991535 (https://phabricator.wikimedia.org/T354432) (owner: 10TrainBranchBot) [09:24:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P54889 and previous config saved to /var/cache/conftool/dbconfig/20240118-092447-marostegui.json [09:25:41] (03CR) 10Huei Tan: "LGTM, +1 when merge conflict is fixed!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989606 (https://phabricator.wikimedia.org/T352454) (owner: 10Sbisson) [09:25:58] !log add 50G to prometheus@k8s-mlserve in codfw [09:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:03] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1046.eqiad.wmnet [09:28:38] (03CR) 10Ayounsi: Validators: enforce Trident3 port block consistency (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) (owner: 10Ayounsi) [09:29:20] (03PS1) 10Filippo Giunchedi: Revert "grafana: Ensure the grafana1002 hosts uses Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/991546 [09:30:40] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "grafana: Ensure the grafana1002 hosts uses Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/991546 (owner: 10Filippo Giunchedi) [09:30:48] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.14 refs T354432 [09:30:52] T354432: 1.42.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T354432 [09:35:35] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1046 [puppet] - 10https://gerrit.wikimedia.org/r/990994 (owner: 10Effie Mouzeli) [09:38:15] (03PS1) 10Btullis: Add a postgresql database for testing superset_next [puppet] - 10https://gerrit.wikimedia.org/r/991539 (https://phabricator.wikimedia.org/T335356) [09:38:21] (03PS1) 10Filippo Giunchedi: puppet: fail the run with puppet 7 and buster [puppet] - 10https://gerrit.wikimedia.org/r/991540 [09:39:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P54890 and previous config saved to /var/cache/conftool/dbconfig/20240118-093954-marostegui.json [09:40:54] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc2046 [puppet] - 10https://gerrit.wikimedia.org/r/990995 (owner: 10Effie Mouzeli) [09:41:21] (03CR) 10CI reject: [V: 04-1] puppet: fail the run with puppet 7 and buster [puppet] - 10https://gerrit.wikimedia.org/r/991540 (owner: 10Filippo Giunchedi) [09:42:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1046.eqiad.wmnet [09:42:55] (03PS3) 10Hashar: Update the openjdk-11 images to match openjdk-8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/990036 (owner: 10Btullis) [09:43:44] (03CR) 10Hashar: [C: 03+1] "I have rebased to fix a conflicts with the weekly build bot which updates the `changelog` files." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/990036 (owner: 10Btullis) [09:44:17] (03CR) 10Btullis: [C: 03+2] Update the openjdk-11 images to match openjdk-8 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/990036 (owner: 10Btullis) [09:44:20] (03CR) 10Btullis: [V: 03+2 C: 03+2] Update the openjdk-11 images to match openjdk-8 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/990036 (owner: 10Btullis) [09:46:13] (03CR) 10Btullis: [C: 03+2] Add a postgresql database for testing superset_next [puppet] - 10https://gerrit.wikimedia.org/r/991539 (https://phabricator.wikimedia.org/T335356) (owner: 10Btullis) [09:49:15] (03CR) 10Filippo Giunchedi: "I had a suspicion this wouldn't be quite so simple, looks like CI workers run rspec on buster:" [puppet] - 10https://gerrit.wikimedia.org/r/991540 (owner: 10Filippo Giunchedi) [09:55:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T354336)', diff saved to https://phabricator.wikimedia.org/P54891 and previous config saved to /var/cache/conftool/dbconfig/20240118-095500-marostegui.json [09:55:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [09:55:05] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [09:55:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [09:55:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54892 and previous config saved to /var/cache/conftool/dbconfig/20240118-095522-marostegui.json [09:57:21] (03PS1) 10Filippo Giunchedi: grafana: temp disable rsync stunnel for puppet7 migration [puppet] - 10https://gerrit.wikimedia.org/r/991542 (https://phabricator.wikimedia.org/T352665) [09:57:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54893 and previous config saved to /var/cache/conftool/dbconfig/20240118-095753-marostegui.json [09:58:25] btullis: you will have to build the refreshed openjdk-11 Docker images, I don't have the access/credentials to do it [09:59:43] (03PS2) 10Msz2001: Promote wikimaniawiki to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991547 (https://phabricator.wikimedia.org/T355297) [10:00:26] hashar: Thank you. I'm already doing so. I didn't think to log it here though. [10:00:34] \o/ [10:01:32] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/output/991542/1152/" [puppet] - 10https://gerrit.wikimedia.org/r/991542 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi) [10:01:53] !log built and published updated openjdk-11 images based on: 11.0.21-s0-20240111 [10:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:35] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2046.codfw.wmnet [10:06:49] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2046 [puppet] - 10https://gerrit.wikimedia.org/r/990995 (owner: 10Effie Mouzeli) [10:07:09] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc1047 [puppet] - 10https://gerrit.wikimedia.org/r/990996 (owner: 10Effie Mouzeli) [10:08:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/991542 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi) [10:09:10] !log T351400 running on a tmux session `foreachwikiindblist group2.dblist extensions/MediaModeration/maintenance/scanFilesInScanTable.php --sleep 0 --verbose 2>&1 | tee ~/scan-files-in-scan-table-group2-sleep-0-non-jobqueue.txt` [10:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:14] T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400 [10:10:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2046.codfw.wmnet [10:11:25] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: temp disable rsync stunnel for puppet7 migration [puppet] - 10https://gerrit.wikimedia.org/r/991542 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi) [10:13:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P54894 and previous config saved to /var/cache/conftool/dbconfig/20240118-101300-marostegui.json [10:13:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc1047.eqiad.wmnet [10:14:14] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc1047 [puppet] - 10https://gerrit.wikimedia.org/r/990996 (owner: 10Effie Mouzeli) [10:14:30] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc2047 [puppet] - 10https://gerrit.wikimedia.org/r/990997 (owner: 10Effie Mouzeli) [10:19:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc1047.eqiad.wmnet [10:20:25] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:08] (03PS1) 10Muehlenhoff: cp: Remove obsolete Hiera entries for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991566 (https://phabricator.wikimedia.org/T349619) [10:22:32] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host mc2047.codfw.wmnet [10:22:55] (03CR) 10Muehlenhoff: [C: 03+2] Switch Mediawiki main memcache clusters to puppet 7: mc2047 [puppet] - 10https://gerrit.wikimedia.org/r/990997 (owner: 10Effie Mouzeli) [10:25:07] (03CR) 10Klausman: [V: 03+2 C: 03+2] Add Lift Wing recommendation-api-ng SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/989187 (https://phabricator.wikimedia.org/T347262) (owner: 10Klausman) [10:25:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2072.codfw.wmnet [10:26:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host mc2047.codfw.wmnet [10:28:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P54896 and previous config saved to /var/cache/conftool/dbconfig/20240118-102806-marostegui.json [10:29:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2072.codfw.wmnet [10:30:19] (03PS2) 10Muehlenhoff: Switch Mediawiki main memcache clusters to puppet 7: mc1048 [puppet] - 10https://gerrit.wikimedia.org/r/990998 (owner: 10Effie Mouzeli) [10:30:25] PROBLEM - Check systemd state on ms-be2072 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:11] (03PS2) 10Cathal Mooney: Use vlan name to determine if server BGP peering should be added [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/991356 (https://phabricator.wikimedia.org/T355225) [10:31:48] (03CR) 10Cathal Mooney: "Thanks for the feedback, good shout on the potential missing "connected endpoint"." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/991356 (https://phabricator.wikimedia.org/T355225) (owner: 10Cathal Mooney) [10:31:55] RECOVERY - Check systemd state on ms-be2072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:15] !log hashar@deploy2002 Started deploy [integration/docroot@88f6458]: Add npm package link for Codex Design Tokens - T354310 [10:32:19] T354310: Sunset WikimediaUI Base - https://phabricator.wikimedia.org/T354310 [10:32:21] 10SRE, 10Machine-Learning-Team, 10Patch-For-Review: Requesting write access to ml-staging-codfw for ML team - https://phabricator.wikimedia.org/T354516 (10klausman) p:05Triage→03High [10:32:22] !log hashar@deploy2002 Finished deploy [integration/docroot@88f6458]: Add npm package link for Codex Design Tokens - T354310 (duration: 00m 07s) [10:36:56] !log hashar@deploy2002 Started deploy [integration/docroot@8f5aa9e]: Add Codex Icons package [10:37:02] !log hashar@deploy2002 Finished deploy [integration/docroot@8f5aa9e]: Add Codex Icons package (duration: 00m 05s) [10:42:21] RECOVERY - Disk space on ms-be2072 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2072&var-datasource=codfw+prometheus/ops [10:43:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T354336)', diff saved to https://phabricator.wikimedia.org/P54898 and previous config saved to /var/cache/conftool/dbconfig/20240118-104313-marostegui.json [10:43:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [10:43:18] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [10:43:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [10:43:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T354336)', diff saved to https://phabricator.wikimedia.org/P54899 and previous config saved to /var/cache/conftool/dbconfig/20240118-104335-marostegui.json [10:49:27] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/991452 [10:50:35] (03CR) 10Ayounsi: [C: 03+1] "lgtm!" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/991356 (https://phabricator.wikimedia.org/T355225) (owner: 10Cathal Mooney) [10:54:22] (03CR) 10Cathal Mooney: [C: 03+2] Use vlan name to determine if server BGP peering should be added [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/991356 (https://phabricator.wikimedia.org/T355225) (owner: 10Cathal Mooney) [10:57:18] (03PS1) 10Filippo Giunchedi: grafana: chown rsync'd files [puppet] - 10https://gerrit.wikimedia.org/r/991569 (https://phabricator.wikimedia.org/T352665) [11:00:04] mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T1100). [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T1100) [11:00:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T354336)', diff saved to https://phabricator.wikimedia.org/P54900 and previous config saved to /var/cache/conftool/dbconfig/20240118-110009-marostegui.json [11:00:29] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:01:41] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1001-1002].eqiad.wmnet with reason: Release v0.6.5 - cmooney@cumin1002 [11:03:59] (03PS1) 10Jelto: prometheus::blackbox::check: make for parameter configurable [puppet] - 10https://gerrit.wikimedia.org/r/991571 (https://phabricator.wikimedia.org/T354479) [11:04:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1001-1002].eqiad.wmnet with reason: Release v0.6.5 - cmooney@cumin1002 [11:04:26] (03PS2) 10Hnowlan: mobileapps: add Cassandra config support [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) [11:05:51] (03CR) 10Hnowlan: mobileapps: add Cassandra config support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [11:06:29] (03CR) 10Cathal Mooney: [C: 03+2] Use vlan name to determine if server BGP peering should be added (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/991356 (https://phabricator.wikimedia.org/T355225) (owner: 10Cathal Mooney) [11:08:01] (03PS5) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) [11:09:42] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1153/" [puppet] - 10https://gerrit.wikimedia.org/r/991571 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto) [11:11:10] (03CR) 10Muehlenhoff: [C: 03+1] "Good idea!" [puppet] - 10https://gerrit.wikimedia.org/r/991540 (owner: 10Filippo Giunchedi) [11:11:27] (03CR) 10Jgiannelos: mobileapps: add Cassandra config support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991032 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan) [11:12:30] 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10MoritzMuehlenhoff) @ItamarWMDE : Ack, thanks for the context. @thcipriani Does that help to approve the request? [11:12:54] 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10MoritzMuehlenhoff) >>! In T354049#9467951, @ArthurTaylor wrote: > Thanks @Dzahn . So how do I get shell access with restricted group? What are the next steps her... [11:13:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/991569 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi) [11:14:19] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:15:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P54901 and previous config saved to /var/cache/conftool/dbconfig/20240118-111516-marostegui.json [11:17:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951079 (owner: 10Muehlenhoff) [11:21:02] !log bounce apache2 on logstash1025 / logstash1031 - T337818 [11:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:08] T337818: apache2 cpu-stuck on logstash1032 causes kafka logging lag - https://phabricator.wikimedia.org/T337818 [11:23:44] (ProbeDown) firing: (6) Service miscweb1003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:28:15] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: chown rsync'd files [puppet] - 10https://gerrit.wikimedia.org/r/991569 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi) [11:29:16] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [11:29:25] (03PS9) 10EoghanGaffney: [gerrit] Refactor classes to specify an active host [puppet] - 10https://gerrit.wikimedia.org/r/988464 [11:30:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P54902 and previous config saved to /var/cache/conftool/dbconfig/20240118-113022-marostegui.json [11:33:56] (03CR) 10Muehlenhoff: [C: 03+2] dragonfly::dfdaemon: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/951079 (owner: 10Muehlenhoff) [11:35:28] (03PS6) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) [11:37:03] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1154/console" [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [11:38:04] (03PS6) 10Cathal Mooney: [WIP] Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [11:38:06] (03CR) 10Cathal Mooney: [C: 03+1] "Overall LGTM. No strong preference on moving firewall::services and "neighbors_list" definitions, but it probably does make sense. I thi" [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [11:38:53] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [11:42:22] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1155/console" [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [11:44:43] (03PS3) 10Muehlenhoff: acme_chief: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970786 [11:45:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T354336)', diff saved to https://phabricator.wikimedia.org/P54903 and previous config saved to /var/cache/conftool/dbconfig/20240118-114528-marostegui.json [11:45:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2189.codfw.wmnet with reason: Maintenance [11:45:42] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [11:45:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2189.codfw.wmnet with reason: Maintenance [11:45:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T354336)', diff saved to https://phabricator.wikimedia.org/P54904 and previous config saved to /var/cache/conftool/dbconfig/20240118-114551-marostegui.json [11:47:17] (03PS3) 10SCherukuwada: Add Google's TXT Verification entry to www for wikifunctions.org. [dns] - 10https://gerrit.wikimedia.org/r/991527 (https://phabricator.wikimedia.org/T355308) [11:48:15] (03PS1) 10Filippo Giunchedi: grafana: deploy puppet dashboards as grafana/grafana [puppet] - 10https://gerrit.wikimedia.org/r/991573 (https://phabricator.wikimedia.org/T352665) [11:48:26] (03PS1) 10Muehlenhoff: Remove now obsolete Hiera host entries [puppet] - 10https://gerrit.wikimedia.org/r/991574 [11:48:46] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/970786 (owner: 10Muehlenhoff) [11:48:54] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) >>! In T355243#9468007, @Tgr wrote: > The PhotoDNA API docs say //"Alternatively, a publicly accessible URL of an image (gif, jpeg, png, b... [11:50:12] (03Abandoned) 10WMDE-Fisch: Fix state bleeding from one into the next [extensions/Kartographer] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991343 (https://phabricator.wikimedia.org/T355044) (owner: 10WMDE-Fisch) [11:51:05] (03CR) 10Vgutierrez: [C: 03+1] "looking good, varnish tests are happy, could you please amend the commit message before merging? ping me on IRC and let's deploy this :D" [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [11:52:32] (03CR) 10Btullis: [C: 03+1] webrequest varnishkafka - Add to X-Analytics prefetch indicators [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [11:54:16] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) [11:54:24] (03CR) 10Vgutierrez: "looking good, please see inline comments" [dns] - 10https://gerrit.wikimedia.org/r/991527 (https://phabricator.wikimedia.org/T355308) (owner: 10SCherukuwada) [11:54:26] (03CR) 10Filippo Giunchedi: "Idea and implementation LGTM, see comments inline and otherwise +1" [puppet] - 10https://gerrit.wikimedia.org/r/991571 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto) [11:57:29] (03PS1) 10Hnowlan: kubernetes: make 3 eqiad jobrunners k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/991575 (https://phabricator.wikimedia.org/T354791) [11:58:37] (03CR) 10Vgutierrez: [C: 03+1] "small typo on the commit msg, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/970786 (owner: 10Muehlenhoff) [11:58:39] (03PS4) 10SCherukuwada: Add Google's TXT Verification entry to www for wikifunctions.org. [dns] - 10https://gerrit.wikimedia.org/r/991527 [11:58:42] (03PS1) 10Hnowlan: Revert "changeprop-jobqueue: disable ThumbnailRender on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991550 [11:59:42] (03PS5) 10SCherukuwada: Add Google's TXT Verification entry to www for wikifunctions.org. [dns] - 10https://gerrit.wikimedia.org/r/991527 [12:00:44] (03CR) 10SCherukuwada: Add Google's TXT Verification entry to www for wikifunctions.org. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/991527 (owner: 10SCherukuwada) [12:01:26] (03PS6) 10SCherukuwada: Add Google's TXT Verification entry for wikifunctions.org. [dns] - 10https://gerrit.wikimedia.org/r/991527 [12:03:22] (03PS1) 10Dreamy Jazz: Remove RENDER_NOW from File::transform call to avoid job thumbnailing [extensions/MediaModeration] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991551 (https://phabricator.wikimedia.org/T355309) [12:03:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] cache.mcrouter: upgrade to 1.3.0 (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli) [12:04:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] cache.mcrouter: upgrade to 1.3.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli) [12:04:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] cache.mcrouter: upgrade to 1.3.0 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli) [12:05:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T352010)', diff saved to https://phabricator.wikimedia.org/P54905 and previous config saved to /var/cache/conftool/dbconfig/20240118-120528-ladsgroup.json [12:05:38] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:05:43] (03PS2) 10Dreamy Jazz: SECURITY: Use message label instead of sanitized text output for massmessage-form-page-help message [extensions/MassMessage] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991552 (https://phabricator.wikimedia.org/T347742) [12:07:52] !log Doing security deploy for T347742 [12:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:57] T347742: CVE-2024-23176: MassMessage i18n key massmessage-form-page-help allows i18n-xss - https://phabricator.wikimedia.org/T347742 [12:08:17] (03PS1) 10KartikMistry: Update MinT to 2024-01-18-051410-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/991578 (https://phabricator.wikimedia.org/T338608) [12:08:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MassMessage] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991552 (https://phabricator.wikimedia.org/T347742) (owner: 10Dreamy Jazz) [12:09:10] (03PS2) 10KartikMistry: Set MT threshold for Punjabi Wikipedia to 97 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991002 (https://phabricator.wikimedia.org/T347789) [12:09:45] (03CR) 10Clément Goubert: [C: 03+1] Revert "changeprop-jobqueue: disable ThumbnailRender on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991550 (owner: 10Hnowlan) [12:10:00] (03PS4) 10Muehlenhoff: acme_chief: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970786 [12:10:10] (03PS2) 10Jelto: prometheus::blackbox::check: make for parameter configurable [puppet] - 10https://gerrit.wikimedia.org/r/991571 (https://phabricator.wikimedia.org/T354479) [12:11:04] (03CR) 10Hnowlan: [C: 03+2] Revert "changeprop-jobqueue: disable ThumbnailRender on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991550 (owner: 10Hnowlan) [12:11:25] (03CR) 10Jelto: prometheus::blackbox::check: make for parameter configurable (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/991571 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto) [12:12:05] (03Merged) 10jenkins-bot: Revert "changeprop-jobqueue: disable ThumbnailRender on k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991550 (owner: 10Hnowlan) [12:12:19] (03CR) 10Kamila Součková: kubernetes: make 3 eqiad jobrunners k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991575 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:13:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/991571 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto) [12:13:16] (03CR) 10Vgutierrez: "please don't drop the Bug: line from the commit message :)" [dns] - 10https://gerrit.wikimedia.org/r/991527 (owner: 10SCherukuwada) [12:14:22] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: make 3 eqiad jobrunners k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/991575 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:15:17] (03Merged) 10jenkins-bot: SECURITY: Use message label instead of sanitized text output for massmessage-form-page-help message [extensions/MassMessage] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991552 (https://phabricator.wikimedia.org/T347742) (owner: 10Dreamy Jazz) [12:15:37] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde, nda for Dima Koushha - https://phabricator.wikimedia.org/T354276 (10Dima) Hi @Dzahn, Great, thank you! [12:15:41] !log jynus@cumin1002 dbctl commit (dc=all): 'Depool db2146', diff saved to https://phabricator.wikimedia.org/P54906 and previous config saved to /var/cache/conftool/dbconfig/20240118-121541-jynus.json [12:16:06] !log depooled db2146, lot of lag, should be investigated later [12:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:16] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1157/" [puppet] - 10https://gerrit.wikimedia.org/r/991571 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto) [12:16:27] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:16:53] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:17:00] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:17:37] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:18:45] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:991552|SECURITY: Use message label instead of sanitized text output for massmessage-form-page-help message (T347742)]] [12:18:49] T347742: CVE-2024-23176: MassMessage i18n key massmessage-form-page-help allows i18n-xss - https://phabricator.wikimedia.org/T347742 [12:18:54] I've downtimed db2146, but maybe it could page [12:19:13] ^ godog hnowlan [12:19:17] ack, thanks [12:19:25] it is a real issue [12:19:29] thank you for the heads up jynus [12:19:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T354336)', diff saved to https://phabricator.wikimedia.org/P54907 and previous config saved to /var/cache/conftool/dbconfig/20240118-121932-marostegui.json [12:19:34] but I depooled it based on Amir feedback [12:19:37] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [12:19:54] keep an eye on it and call him or arnaud if something weird keeps happening [12:20:05] I have no idea why that happened (or maybe it is maintenance) [12:20:21] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1047.eqiad.wmnet [12:20:30] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet [12:20:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P54908 and previous config saved to /var/cache/conftool/dbconfig/20240118-122035-ladsgroup.json [12:20:37] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:991552|SECURITY: Use message label instead of sanitized text output for massmessage-form-page-help message (T347742)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:21:14] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/991585 (https://phabricator.wikimedia.org/T355246) [12:21:19] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [12:21:43] so to be investigated, but it should not impact production at the time [12:22:58] (03PS2) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/991585 (https://phabricator.wikimedia.org/T355246) [12:23:05] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/991453 [12:23:11] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/991585 (https://phabricator.wikimedia.org/T355246) (owner: 10STran) [12:23:49] (03CR) 10Muehlenhoff: [C: 03+2] acme_chief: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970786 (owner: 10Muehlenhoff) [12:24:08] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/991585 (https://phabricator.wikimedia.org/T355246) (owner: 10STran) [12:24:09] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet [12:26:36] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:27:06] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1047.eqiad.wmnet [12:27:13] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:991552|SECURITY: Use message label instead of sanitized text output for massmessage-form-page-help message (T347742)]] (duration: 08m 28s) [12:27:19] T347742: CVE-2024-23176: MassMessage i18n key massmessage-form-page-help allows i18n-xss - https://phabricator.wikimedia.org/T347742 [12:27:30] !log Finished security deploy for T347742 [12:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:56] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:31:46] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:32:13] (03PS7) 10SCherukuwada: Add Google's TXT Verification entry for wikifunctions.org. [dns] - 10https://gerrit.wikimedia.org/r/991527 (https://phabricator.wikimedia.org/T355308) [12:32:48] (03CR) 10SCherukuwada: Add Google's TXT Verification entry for wikifunctions.org. (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/991527 (https://phabricator.wikimedia.org/T355308) (owner: 10SCherukuwada) [12:33:24] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [12:34:28] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [12:34:30] (03PS2) 10Hnowlan: kubernetes: make 3 eqiad jobrunners k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/991575 (https://phabricator.wikimedia.org/T354791) [12:34:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P54909 and previous config saved to /var/cache/conftool/dbconfig/20240118-123439-marostegui.json [12:34:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/991573 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi) [12:35:19] (03CR) 10Hnowlan: kubernetes: make 3 eqiad jobrunners k8s workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991575 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [12:35:20] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [12:35:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P54910 and previous config saved to /var/cache/conftool/dbconfig/20240118-123541-ladsgroup.json [12:39:10] (03CR) 10Filippo Giunchedi: [C: 03+2] grafana: deploy puppet dashboards as grafana/grafana [puppet] - 10https://gerrit.wikimedia.org/r/991573 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi) [12:41:13] !log grafana restarted on grafana1002 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/991573 [12:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:26] (03CR) 10EoghanGaffney: [V: 03+1] [gerrit] Refactor classes to specify an active host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [12:42:28] (03PS2) 10Muehlenhoff: profile::openstack::base::designate::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971449 [12:42:47] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [12:46:26] (03PS4) 10Majavah: P:openstack::magnum: use cloud-private for memcached [puppet] - 10https://gerrit.wikimedia.org/r/977599 [12:46:28] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10Clement_Goubert) Can you hold for hosts in codfw rows A and B for {T354869}? It's not a problem that hosts from these rows have already been changed over, we will just hav... [12:46:34] (03PS1) 10Majavah: P:openstack::heat: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991591 [12:46:38] (03PS1) 10Majavah: P:openstack::cinder: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991592 [12:46:41] (03Abandoned) 10Slyngshede: logstash: add squid cache ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/902664 (owner: 10Slyngshede) [12:46:58] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] [gerrit] Refactor classes to specify an active host [puppet] - 10https://gerrit.wikimedia.org/r/988464 (owner: 10EoghanGaffney) [12:48:50] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10MoritzMuehlenhoff) @Marostegui @ABran-WMF With https://gerrit.wikimedia.org/r/c/operations/puppet/+/991082/ deployed, these are good to rema... [12:49:22] (03Abandoned) 10Slyngshede: Package for Debian [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede) [12:49:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P54911 and previous config saved to /var/cache/conftool/dbconfig/20240118-124945-marostegui.json [12:50:25] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff) [12:50:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T352010)', diff saved to https://phabricator.wikimedia.org/P54912 and previous config saved to /var/cache/conftool/dbconfig/20240118-125048-ladsgroup.json [12:50:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [12:50:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:51:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [12:51:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:51:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:51:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1221 (T352010)', diff saved to https://phabricator.wikimedia.org/P54913 and previous config saved to /var/cache/conftool/dbconfig/20240118-125130-ladsgroup.json [12:51:41] (03CR) 10Majavah: [C: 03+2] P:openstack::magnum: use cloud-private for memcached [puppet] - 10https://gerrit.wikimedia.org/r/977599 (owner: 10Majavah) [12:52:37] (03Abandoned) 10Majavah: kubernetes: Use modern weld api [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/929009 (owner: 10Majavah) [12:52:59] (03Abandoned) 10Majavah: Add an option to disable NFS access [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/920259 (https://phabricator.wikimedia.org/T334081) (owner: 10Majavah) [12:53:37] (03PS1) 10Muehlenhoff: Make ganeti1035 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/991593 (https://phabricator.wikimedia.org/T349925) [12:53:41] (03Abandoned) 10Majavah: [WIP] Add html webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/561753 (https://phabricator.wikimedia.org/T241817) (owner: 10Legoktm) [12:54:23] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [12:54:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [12:56:04] (03CR) 10Majavah: [C: 03+1] profile::openstack::base::designate::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T1300) [13:00:28] (03CR) 10Kamila Součková: [C: 03+1] kubernetes: make 3 eqiad jobrunners k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/991575 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [13:00:51] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:45] (03PS1) 10Majavah: Stop building buster based images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991594 (https://phabricator.wikimedia.org/T287900) [13:01:56] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: puppet7 on cumin breaks database connections - https://phabricator.wikimedia.org/T352974 (10Marostegui) [13:02:20] (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [13:02:33] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 3 others: Revert dbstore migration from puppet7 to puppet5 - https://phabricator.wikimedia.org/T354411 (10Marostegui) 05Stalled→03Declined Good to decline! We can always reopen if needed. Thank you Ben for the help you've provided troubleshoo... [13:04:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T354336)', diff saved to https://phabricator.wikimedia.org/P54914 and previous config saved to /var/cache/conftool/dbconfig/20240118-130451-marostegui.json [13:04:57] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [13:06:17] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1035 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/991593 (https://phabricator.wikimedia.org/T349925) (owner: 10Muehlenhoff) [13:11:27] (03PS1) 10Majavah: Provide a standalone bookworm-web container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/991595 (https://phabricator.wikimedia.org/T355231) [13:27:25] (03PS7) 10EoghanGaffney: [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) [13:28:50] !log installing python-requests security updates [13:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:31] 10SRE, 10Infrastructure-Foundations, 10netops: Verify and Configure ECMP operation for EVPN switches - https://phabricator.wikimedia.org/T334658 (10cmooney) 05Open→03Resolved Closing this. It's a global setting and as per the description we need to keep ports in play to get a load-balance for VXLAN traf... [13:30:32] (03CR) 10CI reject: [V: 04-1] [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [13:31:09] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1158/co" [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [13:31:27] PROBLEM - WDQS SPARQL on wdqs1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:32:11] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:32:13] PROBLEM - WDQS SPARQL on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:32:28] (03PS8) 10EoghanGaffney: [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) [13:34:07] RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:35:25] RECOVERY - WDQS SPARQL on wdqs1018 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:35:50] (03CR) 10CI reject: [V: 04-1] [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [13:36:10] (03PS9) 10EoghanGaffney: [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) [13:36:11] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.641 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [13:36:37] (03CR) 10Cathal Mooney: Add BGP to the contributing protocols for aggregate routes on CRs (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/975070 (https://phabricator.wikimedia.org/T351456) (owner: 10Cathal Mooney) [13:38:20] 10SRE, 10Infrastructure-Foundations, 10netops: Create single Homer BGP group template to cover all variants - https://phabricator.wikimedia.org/T349116 (10cmooney) [13:39:18] (03CR) 10CI reject: [V: 04-1] [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [13:39:22] (03PS10) 10EoghanGaffney: [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) [13:42:10] 10SRE, 10Infrastructure-Foundations, 10netops: Firewall filter blocking traceroute in underlay QFX5120 EVPN - https://phabricator.wikimedia.org/T348120 (10cmooney) >>! In T348120#9224531, @ayounsi wrote: > Nice rabbit hole! I found this: https://www.reddit.com/r/Juniper/comments/g12qxh/the_right_way_to_allow... [13:42:31] (03CR) 10CI reject: [V: 04-1] [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [13:46:47] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [13:49:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 1%: T355313', diff saved to https://phabricator.wikimedia.org/P54915 and previous config saved to /var/cache/conftool/dbconfig/20240118-134936-root.json [13:49:41] T355313: db2146 started being behind on replication - https://phabricator.wikimedia.org/T355313 [13:51:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance [13:51:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2107.codfw.wmnet with reason: Maintenance [13:53:13] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) [13:53:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:53:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance [13:54:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:54:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:54:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T354336)', diff saved to https://phabricator.wikimedia.org/P54916 and previous config saved to /var/cache/conftool/dbconfig/20240118-135422-marostegui.json [13:54:27] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [13:56:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T354336)', diff saved to https://phabricator.wikimedia.org/P54917 and previous config saved to /var/cache/conftool/dbconfig/20240118-135633-marostegui.json [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T1400). [14:00:05] Dreamy_Jazz, kart_, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:12] \o [14:00:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] mediawiki::cgroup: Enanble v1 cgroups on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991347 (https://phabricator.wikimedia.org/T325228) (owner: 10Muehlenhoff) [14:00:20] o/ [14:00:21] I can deploy my one [14:01:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991551 (https://phabricator.wikimedia.org/T355309) (owner: 10Dreamy Jazz) [14:03:42] (03Merged) 10jenkins-bot: Remove RENDER_NOW from File::transform call to avoid job thumbnailing [extensions/MediaModeration] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991551 (https://phabricator.wikimedia.org/T355309) (owner: 10Dreamy Jazz) [14:03:58] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:991551|Remove RENDER_NOW from File::transform call to avoid job thumbnailing (T355309)]] [14:04:03] T355309: Don't use RENDER_NOW when generating thumbnails for PhotoDNA scans - https://phabricator.wikimedia.org/T355309 [14:04:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 5%: T355313', diff saved to https://phabricator.wikimedia.org/P54918 and previous config saved to /var/cache/conftool/dbconfig/20240118-140441-root.json [14:04:46] T355313: db2146 started being behind on replication - https://phabricator.wikimedia.org/T355313 [14:05:15] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:991551|Remove RENDER_NOW from File::transform call to avoid job thumbnailing (T355309)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:10] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:06:30] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: aqs [14:07:07] !log stopped MediaModerations scan for group2 [14:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:58] !log Stopped MediaModeration scan for commonswiki [14:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:23] (03PS1) 10Muehlenhoff: Switch aqs to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991600 (https://phabricator.wikimedia.org/T349619) [14:10:58] (03CR) 10Muehlenhoff: [C: 03+2] Switch aqs to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991600 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:11:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P54919 and previous config saved to /var/cache/conftool/dbconfig/20240118-141139-marostegui.json [14:11:48] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:991551|Remove RENDER_NOW from File::transform call to avoid job thumbnailing (T355309)]] (duration: 07m 50s) [14:11:52] T355309: Don't use RENDER_NOW when generating thumbnails for PhotoDNA scans - https://phabricator.wikimedia.org/T355309 [14:12:38] !log running on a tmux session `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30-no-render-now.txt` [14:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:43] I'm testing my backport using that maintenance script. If things go wrong I will stop the script and may need to undo my backport in the worst case. [14:14:00] In the mean while, I think the other changes can continue. [14:15:43] Dreamy_Jazz: Should I go ahead with my change? [14:16:20] Sure. [14:16:27] It looks to be working. [14:16:41] that is my fix [14:16:54] (03PS2) 10Majavah: P:openstack::heat: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991591 [14:16:56] (03PS2) 10Majavah: P:openstack::cinder: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991592 [14:17:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991002 (https://phabricator.wikimedia.org/T347789) (owner: 10KartikMistry) [14:18:01] PROBLEM - Disk space on ms-be2072 is CRITICAL: DISK CRITICAL - /srv/swift-storage/objects0 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2072&var-datasource=codfw+prometheus/ops [14:18:26] (03Merged) 10jenkins-bot: Set MT threshold for Punjabi Wikipedia to 97 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991002 (https://phabricator.wikimedia.org/T347789) (owner: 10KartikMistry) [14:18:38] !log kartik@deploy2002 Started scap: Backport for [[gerrit:991002|Set MT threshold for Punjabi Wikipedia to 97 (T347789)]] [14:18:50] T347789: Limit Adjustment for Translate to Punjabi Pa by Google Translate from English - https://phabricator.wikimedia.org/T347789 [14:19:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: T355313', diff saved to https://phabricator.wikimedia.org/P54920 and previous config saved to /var/cache/conftool/dbconfig/20240118-141946-root.json [14:19:52] T355313: db2146 started being behind on replication - https://phabricator.wikimedia.org/T355313 [14:19:57] !log kartik@deploy2002 kartik: Backport for [[gerrit:991002|Set MT threshold for Punjabi Wikipedia to 97 (T347789)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:20:41] anzx: Just to check, do you have deployment rights? If not and no-one else is around, I could deploy your change. [14:21:16] Dreamy_Jazz: i don't have deployment rights, you can deploy [14:21:30] Sure. I'll wait until the previous config change is deployed. [14:21:37] Ok [14:21:53] From what I can tell you should be able to test this change? [14:22:00] yes [14:22:08] 👍 [14:22:56] !log kartik@deploy2002 kartik: Continuing with sync [14:24:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: aqs [14:24:55] PROBLEM - Check systemd state on ms-be2072 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:05] (03CR) 10Majavah: [C: 03+2] P:openstack::heat: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991591 (owner: 10Majavah) [14:26:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P54921 and previous config saved to /var/cache/conftool/dbconfig/20240118-142646-marostegui.json [14:27:08] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:28:42] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:991002|Set MT threshold for Punjabi Wikipedia to 97 (T347789)]] (duration: 10m 03s) [14:28:46] T347789: Limit Adjustment for Translate to Punjabi Pa by Google Translate from English - https://phabricator.wikimedia.org/T347789 [14:29:07] Dreamy_Jazz: I'm done with config change. [14:29:14] Thanks! [14:29:17] (ProbeDown) firing: (2) Service debmonitor2003:443 has failed probes (http_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:29:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) (owner: 10Anzx) [14:29:45] (03CR) 10Jelto: [V: 03+1 C: 03+2] miscweb/microsites: remove profile::microsites::design [puppet] - 10https://gerrit.wikimedia.org/r/991011 (https://phabricator.wikimedia.org/T350791) (owner: 10Jelto) [14:30:11] (03Merged) 10jenkins-bot: thwiki: update tagline and optimise other logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989750 (https://phabricator.wikimedia.org/T341407) (owner: 10Anzx) [14:30:27] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:989750|thwiki: update tagline and optimise other logos (T341407)]] [14:30:31] T341407: Update th.wikipedia.org logo - https://phabricator.wikimedia.org/T341407 [14:31:50] !log dreamyjazz@deploy2002 anzx and dreamyjazz: Backport for [[gerrit:989750|thwiki: update tagline and optimise other logos (T341407)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:31:58] anzx: Please test [14:32:00] Dreamy_Jazz: checking [14:32:04] 👍 [14:32:56] Dreamy_Jazz: looks good [14:33:03] Thanks. Continuing. [14:33:05] !log dreamyjazz@deploy2002 anzx and dreamyjazz: Continuing with sync [14:34:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: T355313', diff saved to https://phabricator.wikimedia.org/P54922 and previous config saved to /var/cache/conftool/dbconfig/20240118-143451-root.json [14:34:56] T355313: db2146 started being behind on replication - https://phabricator.wikimedia.org/T355313 [14:35:56] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 10 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [14:36:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [14:37:28] (03CR) 10Clément Goubert: [C: 03+1] k8s topology labels: add row to rack transition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [14:38:55] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:989750|thwiki: update tagline and optimise other logos (T341407)]] (duration: 08m 28s) [14:38:59] Dreamy_Jazz: please run maintenance script https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Purging https://www.irccloud.com/pastebin/r52HNELx/ [14:39:08] T341407: Update th.wikipedia.org logo - https://phabricator.wikimedia.org/T341407 [14:39:19] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:35] Thanks for the commands. Doing that now. [14:39:41] (03CR) 10Ssingh: [C: 03+1] "And thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/991566 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:41:46] !log Ran `echo 'https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-th.svg' | mwscript purgeList.php`, `echo 'https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-th.svg' | mwscript purgeList.php`, `echo 'https://en.wikipedia.org/static/images/project-logos/thwiki.png' | mwscript purgeList.php`, `echo 'https://en.wikipedia.org/static/images/project-logos/thwiki-1.5x.png' | [14:41:46] mwscript purgeList.php`, and `echo 'https://en.wikipedia.org/static/images/project-logos/thwiki-2x.png' | mwscript purgeList.php` [14:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T354336)', diff saved to https://phabricator.wikimedia.org/P54923 and previous config saved to /var/cache/conftool/dbconfig/20240118-144152-marostegui.json [14:41:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:41:57] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [14:42:01] (03CR) 10Volans: "LGTM but I have few doubts inline" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [14:42:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1170.eqiad.wmnet with reason: Maintenance [14:42:11] Dreamy_Jazz: thank you for deploying [14:42:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170:3317 (T354336)', diff saved to https://phabricator.wikimedia.org/P54924 and previous config saved to /var/cache/conftool/dbconfig/20240118-144214-marostegui.json [14:42:23] !log disable puppet on ms-be2072 to try and deal with faulty drive [14:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T354336)', diff saved to https://phabricator.wikimedia.org/P54925 and previous config saved to /var/cache/conftool/dbconfig/20240118-144228-marostegui.json [14:42:43] No problem [14:42:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:43:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:43:26] !log Afternoon UTC backport window done [14:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:44:42] (03CR) 10Muehlenhoff: [C: 03+1] Package Debmonitor server as .deb (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [14:45:08] (03PS1) 10Phuedx: ext-EventLogging,ext-EventStreamConfig: Remove mediawiki.special_diff_interactions stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991606 (https://phabricator.wikimedia.org/T353366) [14:46:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:47:17] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:25] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:33] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:57] (03Abandoned) 10SCherukuwada: Add Yandex's TXT verification entry to www. [dns] - 10https://gerrit.wikimedia.org/r/768664 (https://phabricator.wikimedia.org/T302617) (owner: 10SCherukuwada) [14:49:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: T355313', diff saved to https://phabricator.wikimedia.org/P54926 and previous config saved to /var/cache/conftool/dbconfig/20240118-144956-root.json [14:50:01] T355313: db2146 started being behind on replication - https://phabricator.wikimedia.org/T355313 [14:51:00] (03CR) 10Volans: [C: 03+1] "Looks ok to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/989130 (https://phabricator.wikimedia.org/T297026) (owner: 10Majavah) [14:51:47] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) [14:53:04] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sda) failed in ms-be2072 - https://phabricator.wikimedia.org/T355330 (10MatthewVernon) [14:53:34] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sda) failed in ms-be2072 - https://phabricator.wikimedia.org/T355330 (10MatthewVernon) p:05Triage→03High [14:55:32] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.6 - https://phabricator.wikimedia.org/T316421 (10Jelto) a:03Jelto [14:55:41] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) p:05Unbreak!→03Triage After I backported the patch in {T355309} and restarted the script with the job queue method, I no longer see th... [14:56:15] (03CR) 10Hashar: "For sure! My guess is we can verify the outcome via the ECS based Apache access log dashboard at https://logstash.wikimedia.org/app/dashbo" [puppet] - 10https://gerrit.wikimedia.org/r/967877 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar) [14:56:20] (03CR) 10Hashar: [C: 03+1] httpd: ErrorLogFormat to strip fields with unavailable values [puppet] - 10https://gerrit.wikimedia.org/r/967877 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar) [14:56:33] RECOVERY - Check systemd state on ms-be2072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:57:16] (03PS7) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) [14:57:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P54927 and previous config saved to /var/cache/conftool/dbconfig/20240118-145734-marostegui.json [14:59:01] RECOVERY - Disk space on ms-be2072 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2072&var-datasource=codfw+prometheus/ops [14:59:19] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:34] (03CR) 10Muehlenhoff: [C: 03+2] cp: Remove obsolete Hiera entries for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991566 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:01:08] (03CR) 10Jelto: [V: 03+1 C: 03+2] prometheus::blackbox::check: make for parameter configurable [puppet] - 10https://gerrit.wikimedia.org/r/991571 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto) [15:02:07] (03CR) 10Effie Mouzeli: [C: 03+2] Remove now obsolete Hiera host entries [puppet] - 10https://gerrit.wikimedia.org/r/991574 (owner: 10Muehlenhoff) [15:05:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: T355313', diff saved to https://phabricator.wikimedia.org/P54928 and previous config saved to /var/cache/conftool/dbconfig/20240118-150501-root.json [15:05:08] T355313: db2146 started being behind on replication - https://phabricator.wikimedia.org/T355313 [15:09:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) (owner: 10Effie Mouzeli) [15:10:15] (03CR) 10Hnowlan: [C: 03+2] kubernetes: make 3 eqiad jobrunners k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/991575 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:12:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P54929 and previous config saved to /var/cache/conftool/dbconfig/20240118-151241-marostegui.json [15:18:29] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1461.eqiad.wmnet with OS bullseye [15:18:35] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1469.eqiad.wmnet with OS bullseye [15:18:42] !log hnowlan@cumin1002 START - Cookbook sre.hosts.reimage for host mw1439.eqiad.wmnet with OS bullseye [15:18:42] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1461.eqiad.wmnet with OS bullseye [15:18:47] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1469.eqiad.wmnet with OS bullseye [15:18:55] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1439.eqiad.wmnet with OS bullseye [15:19:07] (03CR) 10Filippo Giunchedi: [C: 03+2] puppetserver: move ::generators from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/991361 (https://phabricator.wikimedia.org/T333615) (owner: 10Filippo Giunchedi) [15:20:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: T355313', diff saved to https://phabricator.wikimedia.org/P54930 and previous config saved to /var/cache/conftool/dbconfig/20240118-152006-root.json [15:20:11] T355313: db2146 started being behind on replication - https://phabricator.wikimedia.org/T355313 [15:22:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor pedantic note, otherwise this LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/987393 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [15:23:59] (ProbeDown) firing: (6) Service miscweb1003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:25:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:27:00] that ^ could be me but all hosts have been marked inactive [15:27:16] *all reclaimed hosts [15:27:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T354336)', diff saved to https://phabricator.wikimedia.org/P54931 and previous config saved to /var/cache/conftool/dbconfig/20240118-152747-marostegui.json [15:27:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:27:52] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [15:28:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:28:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:28:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:28:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T354336)', diff saved to https://phabricator.wikimedia.org/P54932 and previous config saved to /var/cache/conftool/dbconfig/20240118-152832-marostegui.json [15:29:35] 10SRE-OnFire, 10Znuny, 10collaboration-services, 10Patch-For-Review: ticket.wikimedia.org should page when down - https://phabricator.wikimedia.org/T354479 (10Jelto) blackbox checks can be delayed by setting `alert_after` now: ` prometheus::blackbox::check::http { $host: team => 'collaborat... [15:30:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T354336)', diff saved to https://phabricator.wikimedia.org/P54933 and previous config saved to /var/cache/conftool/dbconfig/20240118-153042-marostegui.json [15:31:57] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1469.eqiad.wmnet with reason: host reimage [15:32:29] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1439.eqiad.wmnet with reason: host reimage [15:32:33] !log hnowlan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1461.eqiad.wmnet with reason: host reimage [15:32:40] (03PS13) 10Dr0ptp4kt: Varnish: enrich X-Analytics for browser prefetch / prerender / preview [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [15:34:34] (03PS14) 10Dr0ptp4kt: Varnish: enrich X-Analytics for browser prefetch / prerender / preview [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [15:35:04] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1469.eqiad.wmnet with reason: host reimage [15:37:32] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1439.eqiad.wmnet with reason: host reimage [15:38:28] (03PS15) 10Dr0ptp4kt: varnish: enrich X-Analytics for browser prefetch / prerender / preview [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [15:40:20] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1461.eqiad.wmnet with reason: host reimage [15:45:04] hnowlan: I have to stop puppet on k8s nodes for a bit, I don't think it will run on your nodes due to host key, and it shouldn't change anything on them. It's just a heads up in case it makes the cookbook do something funky [15:45:18] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye [15:45:32] PROBLEM - Check systemd state on kubernetes1018 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:36] !log stopping puppet on P:kubernetes::node to deploy 980927 - T352893 [15:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:41] T352893: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 [15:45:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P54935 and previous config saved to /var/cache/conftool/dbconfig/20240118-154549-marostegui.json [15:46:22] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet [15:46:28] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1045.eqiad.wmnet [15:47:13] claime: ack, nbd. they've all at least started their first run [15:47:19] ack [15:47:31] (03CR) 10Clément Goubert: [C: 03+2] k8s topology labels: add row to rack transition [puppet] - 10https://gerrit.wikimedia.org/r/980927 (https://phabricator.wikimedia.org/T352893) (owner: 10Ayounsi) [15:47:56] RECOVERY - Check systemd state on kubernetes1018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:11] !log Running puppet on kubestage2002 - T352893 [15:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [15:50:38] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2089.codfw.wmnet with OS bullseye [15:52:07] !log stopping puppet on P:kubernetes::node to deploy 980927 - T352883 [15:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:11] T352883: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 [15:52:12] !log Running puppet on kubestage2002 - T352883 [15:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:18] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet [15:52:20] (fixing task id...) [15:52:58] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1045.eqiad.wmnet [15:53:17] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) `lang=bash cgoubert@kubestage2002:~$ sudo calicoctl node status Calico process is running. IPv4 BGP status +---... [15:53:19] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1469.eqiad.wmnet with OS bullseye [15:53:28] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1469.eqiad.wmnet with OS bullseye completed: - mw1469 (**PASS**) - Downtimed on Icinga/Alertma... [15:54:06] (03CR) 10Ssingh: "Some small questions and comments in-line. Also +1 to moving some specific bits to the bird module instead; let's do that in separate comm" [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [15:54:10] !log Running puppet on A:wikikube-staging-worker - T352883 [15:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:45] (03PS8) 10Effie Mouzeli: cache.mcrouter: upgrade to 1.3.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/991372 (https://phabricator.wikimedia.org/T355237) [15:56:05] (03Abandoned) 10Dzahn: contint: use the same PHP packages on contint before and after distro upgrade [puppet] - 10https://gerrit.wikimedia.org/r/987458 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [15:56:37] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1439.eqiad.wmnet with OS bullseye [15:56:46] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1439.eqiad.wmnet with OS bullseye completed: - mw1439 (**PASS**) - Downtimed on Icinga/Alertma... [15:57:47] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2090.codfw.wmnet with OS bullseye [15:59:35] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1461.eqiad.wmnet with OS bullseye [15:59:44] 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1461.eqiad.wmnet with OS bullseye completed: - mw1461 (**PASS**) - Downtimed on Icinga/Alertma... [16:00:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P54936 and previous config saved to /var/cache/conftool/dbconfig/20240118-160055-marostegui.json [16:02:49] !log disabling PyBal and puppet on lvs2011, moving traffic to lvs2014 ahead of network change T352912 [16:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:55] T352912: Move lvs2011 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352912 [16:03:34] !log Running puppet on 'P{P:kubernetes::node} and P{F:lldp.parent ~ lsw}' - T352883 [16:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:42] T352883: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 [16:04:12] (03CR) 10Vgutierrez: [C: 03+2] varnish: enrich X-Analytics for browser prefetch / prerender / preview [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [16:04:17] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2091.codfw.wmnet with OS bullseye [16:04:24] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2011.codfw.wmnet with reason: moving lvs2011 network link T352912 [16:04:26] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:04:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: moving lvs2011 network link T352912 [16:05:58] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cr2-codfw,cr[1-2]-codfw IPv6,re0.cr1-codfw.mgmt,re0.cr2-codfw.mgmt cr1-codfw with reason: moving lvs2011 network link T352912 [16:05:59] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cr2-codfw,cr[1-2]-codfw IPv6,re0.cr1-codfw.mgmt,re0.cr2-codfw.mgmt cr1-codfw with reason: moving lvs2011 network link T352912 [16:06:16] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: moving lvs2011 network link T352912 [16:06:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: moving lvs2011 network link T352912 [16:06:39] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2092.codfw.wmnet with OS bullseye [16:08:49] (03PS1) 10Btullis: Add a superset_staging database to an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/991615 (https://phabricator.wikimedia.org/T335356) [16:09:08] 10SRE, 10ops-codfw: cr2-codfw:FPC0 failure - https://phabricator.wikimedia.org/T354732 (10Papaul) 05Open→03Resolved a:03Papaul linecard removed from cr2 and deleted from netbox [16:09:50] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2093.codfw.wmnet with OS bullseye [16:11:42] (03CR) 10Btullis: [C: 03+2] Add a superset_staging database to an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/991615 (https://phabricator.wikimedia.org/T335356) (owner: 10Btullis) [16:12:28] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye [16:14:56] (03CR) 10Ssingh: Add new codfw per-rack vlans to lvs2011 and move row B vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912) (owner: 10Cathal Mooney) [16:15:23] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2095.codfw.wmnet with OS bullseye [16:15:59] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2090.codfw.wmnet with reason: host reimage [16:16:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T354336)', diff saved to https://phabricator.wikimedia.org/P54937 and previous config saved to /var/cache/conftool/dbconfig/20240118-161602-marostegui.json [16:16:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [16:16:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1191.eqiad.wmnet with reason: Maintenance [16:16:20] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [16:16:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T354336)', diff saved to https://phabricator.wikimedia.org/P54938 and previous config saved to /var/cache/conftool/dbconfig/20240118-161624-marostegui.json [16:17:47] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) No-op on these nodes, proceeding with the rest. [16:18:27] !log Running puppet on 'P{P:kubernetes::node} and not P{F:lldp.parent ~ lsw}' - T352883 [16:18:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2096.codfw.wmnet with OS bullseye [16:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:36] (03PS4) 10Peter Fischer: Search update pipeline: update README [deployment-charts] - 10https://gerrit.wikimedia.org/r/987494 (https://phabricator.wikimedia.org/T354197) [16:18:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T354336)', diff saved to https://phabricator.wikimedia.org/P54939 and previous config saved to /var/cache/conftool/dbconfig/20240118-161834-marostegui.json [16:18:36] T352883: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 [16:18:52] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2090.codfw.wmnet with reason: host reimage [16:19:13] (03CR) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2011 and move row B vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912) (owner: 10Cathal Mooney) [16:19:15] (03CR) 10Ssingh: Add new codfw per-rack vlans to lvs2011 and move row B vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912) (owner: 10Cathal Mooney) [16:19:58] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) >>! In T352883#9469622, @Clement_Goubert wrote: > `lang=bash > IPv6 BGP status > +-------------------+----------... [16:22:00] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2097.codfw.wmnet with OS bullseye [16:22:59] (03PS8) 10Ssingh: wikifunctions: Add Google's TXT Verification entry for wikifunctions.org. [dns] - 10https://gerrit.wikimedia.org/r/991527 (https://phabricator.wikimedia.org/T355308) (owner: 10SCherukuwada) [16:24:24] (03CR) 10Ssingh: [C: 03+2] wikifunctions: Add Google's TXT Verification entry for wikifunctions.org. [dns] - 10https://gerrit.wikimedia.org/r/991527 (https://phabricator.wikimedia.org/T355308) (owner: 10SCherukuwada) [16:25:04] !log running authdns-update for T355308 [16:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:09] T355308: WikiFunctions: Domain Verification for Google Search Console - https://phabricator.wikimedia.org/T355308 [16:25:28] (03CR) 10Peter Fischer: "Added more (detailed) hints to README" [deployment-charts] - 10https://gerrit.wikimedia.org/r/987494 (https://phabricator.wikimedia.org/T354197) (owner: 10Peter Fischer) [16:25:48] (03PS3) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2011 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912) [16:27:29] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2098.codfw.wmnet with OS bullseye [16:27:35] (03CR) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2011 and move row B vlans (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912) (owner: 10Cathal Mooney) [16:29:29] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343 (10Papaul) [16:30:03] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343 (10Papaul) [16:30:43] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343 (10Marostegui) @Papaul all the puppet changes are in place :-) [16:31:16] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343 (10Papaul) [16:32:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269 (10Marostegui) [16:32:53] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343 (10Papaul) @Marostegui thank you [16:32:59] (03PS4) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2011 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912) [16:33:01] !log hashar@deploy2002 Started deploy [integration/docroot@1d9323f]: Remove Wikimedia Design Style Guide from the list - T347895 [16:33:05] T347895: Redirect DSG to Codex's Style Guide - https://phabricator.wikimedia.org/T347895 [16:33:08] !log hashar@deploy2002 Finished deploy [integration/docroot@1d9323f]: Remove Wikimedia Design Style Guide from the list - T347895 (duration: 00m 07s) [16:33:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P54940 and previous config saved to /var/cache/conftool/dbconfig/20240118-163342-marostegui.json [16:33:49] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343 (10Papaul) [16:35:10] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2099.codfw.wmnet with OS bullseye [16:36:26] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2090.codfw.wmnet with OS bullseye [16:38:49] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T355345 (10phaultfinder) [16:40:08] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2011 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352912 (10cmooney) [16:40:30] (03CR) 10Ssingh: [C: 03+1] Add new codfw per-rack vlans to lvs2011 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912) (owner: 10Cathal Mooney) [16:41:28] (03CR) 10Cathal Mooney: [C: 03+2] Add new codfw per-rack vlans to lvs2011 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980940 (https://phabricator.wikimedia.org/T352912) (owner: 10Cathal Mooney) [16:41:31] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10serviceops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) No-op on the rest of the infra. [16:42:23] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2100.codfw.wmnet with OS bullseye [16:43:02] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10Clement_Goubert) Summary of deployment from {T352883}: - No-op on all nodes except kubestage200... [16:43:11] (03Abandoned) 10Aaron Schulz: Simplify comments and stubs for etcd-defined DB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752212 (owner: 10Aaron Schulz) [16:48:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P54941 and previous config saved to /var/cache/conftool/dbconfig/20240118-164848-marostegui.json [16:49:11] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2101.codfw.wmnet with OS bullseye [16:54:02] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2102.codfw.wmnet with OS bullseye [16:54:19] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:00:05] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:40] (03CR) 10Effie Mouzeli: [C: 04-1] "We need to rethink the approach since it might confuse people who are only learning the ropes of k8s as well as our modules here and sexta" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991369 (owner: 10Alexandros Kosiaris) [17:03:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T354336)', diff saved to https://phabricator.wikimedia.org/P54942 and previous config saved to /var/cache/conftool/dbconfig/20240118-170355-marostegui.json [17:03:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:04:00] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [17:04:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:04:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T354336)', diff saved to https://phabricator.wikimedia.org/P54943 and previous config saved to /var/cache/conftool/dbconfig/20240118-170417-marostegui.json [17:06:14] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2088.codfw.wmnet with OS bullseye [17:06:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T354336)', diff saved to https://phabricator.wikimedia.org/P54944 and previous config saved to /var/cache/conftool/dbconfig/20240118-170627-marostegui.json [17:11:02] 10ops-codfw, 10DBA, 10DC-Ops: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10RobH) [17:11:19] 10ops-codfw, 10DBA, 10DC-Ops: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10RobH) [17:11:35] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2089.codfw.wmnet with OS bullseye [17:11:42] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2102.codfw.wmnet with reason: host reimage [17:13:30] 10ops-codfw, 10DBA, 10DC-Ops: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10RobH) [17:14:21] (03PS1) 10Cathal Mooney: Move codfw row-b sub-interfaces to primary uplink lvs2011 [puppet] - 10https://gerrit.wikimedia.org/r/991618 (https://phabricator.wikimedia.org/T352912) [17:14:54] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2102.codfw.wmnet with reason: host reimage [17:16:53] (03CR) 10Ssingh: [C: 03+1] Move codfw row-b sub-interfaces to primary uplink lvs2011 [puppet] - 10https://gerrit.wikimedia.org/r/991618 (https://phabricator.wikimedia.org/T352912) (owner: 10Cathal Mooney) [17:18:17] (03PS1) 10Cathal Mooney: Skip switch interface if no untagged_vlan when finding bgp peers [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/991619 (https://phabricator.wikimedia.org/T355225) [17:19:16] (03PS1) 10Dreamy Jazz: Log to statsd HTTP status codes and reduce logstash log levels [extensions/MediaModeration] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991555 (https://phabricator.wikimedia.org/T355216) [17:19:24] (03CR) 10Cathal Mooney: [C: 03+2] Move codfw row-b sub-interfaces to primary uplink lvs2011 [puppet] - 10https://gerrit.wikimedia.org/r/991618 (https://phabricator.wikimedia.org/T352912) (owner: 10Cathal Mooney) [17:20:59] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2098.codfw.wmnet with OS bullseye [17:21:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P54945 and previous config saved to /var/cache/conftool/dbconfig/20240118-172134-marostegui.json [17:23:24] (03PS24) 10Brouberol: global_config: list IPs of hadoop master/workers and kerberos nodes [puppet] - 10https://gerrit.wikimedia.org/r/987393 (https://phabricator.wikimedia.org/T331894) [17:23:31] (03CR) 10Brouberol: "Thanks Αλέξανδρος for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/987393 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [17:24:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:25:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2091.codfw.wmnet with OS bullseye [17:25:33] (03CR) 10Jforrester: [C: 03+1] "LGTM. Do you want help getting this deployed or are you happy to schedule it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991547 (https://phabricator.wikimedia.org/T355297) (owner: 10Msz2001) [17:27:35] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2092.codfw.wmnet with OS bullseye [17:28:40] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2099.codfw.wmnet with OS bullseye [17:28:41] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) [17:29:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.14% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:30:45] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2093.codfw.wmnet with OS bullseye [17:30:49] !log Re-enabling PyBal on lvs2011 after network migration T352912 [17:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:53] T352912: Move lvs2011 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352912 [17:31:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2102.codfw.wmnet with OS bullseye [17:33:20] (03CR) 10Dr0ptp4kt: varnish: enrich X-Analytics for browser prefetch / prerender / preview (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [17:33:24] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bullseye [17:33:44] (03CR) 10Dr0ptp4kt: varnish: enrich X-Analytics for browser prefetch / prerender / preview (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/981352 (https://phabricator.wikimedia.org/T346463) (owner: 10Ottomata) [17:33:57] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:34:19] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:36:04] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2100.codfw.wmnet with OS bullseye [17:36:30] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2095.codfw.wmnet with OS bullseye [17:36:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P54946 and previous config saved to /var/cache/conftool/dbconfig/20240118-173640-marostegui.json [17:37:20] (03CR) 10Msz2001: Promote wikimaniawiki to Vector 2022 as default skin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991547 (https://phabricator.wikimedia.org/T355297) (owner: 10Msz2001) [17:39:19] (JobUnavailable) resolved: Reduced availability for job pybal in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:39:26] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2096.codfw.wmnet with OS bullseye [17:42:41] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2101.codfw.wmnet with OS bullseye [17:43:06] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2097.codfw.wmnet with OS bullseye [17:45:26] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:48:57] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T354336)', diff saved to https://phabricator.wikimedia.org/P54947 and previous config saved to /var/cache/conftool/dbconfig/20240118-175147-marostegui.json [17:51:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:51:53] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [17:52:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1202.eqiad.wmnet with reason: Maintenance [17:52:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T354336)', diff saved to https://phabricator.wikimedia.org/P54948 and previous config saved to /var/cache/conftool/dbconfig/20240118-175209-marostegui.json [17:53:43] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney) [17:53:52] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2011 primary uplink and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352912 (10cmooney) 05Open→03Resolved Alll done! [17:54:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T354336)', diff saved to https://phabricator.wikimedia.org/P54949 and previous config saved to /var/cache/conftool/dbconfig/20240118-175420-marostegui.json [17:54:41] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [17:54:47] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney) [17:54:55] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2012 from private1-b-codfw (row) to private1-b2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352918 (10cmooney) [17:55:07] 10SRE, 10Infrastructure-Foundations, 10netops: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets - https://phabricator.wikimedia.org/T354869 (10cmooney) [17:55:15] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney) [17:55:23] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2011 from private1-a-codfw (row) to private1-a2-codfw (rack) vlan - https://phabricator.wikimedia.org/T352920 (10cmooney) [17:55:40] 10SRE, 10Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [17:55:50] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) 05Open→03Resolved [17:56:01] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Migrate lvs2011 and lvs2012 to new top-of-rack switches - https://phabricator.wikimedia.org/T348178 (10cmooney) 05Open→03Resolved [17:56:09] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Codfw row A-B migration - non-standard device moves - https://phabricator.wikimedia.org/T348128 (10cmooney) [18:00:05] bd808: Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T1800). Please do the needful. [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T1800) [18:04:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T352010)', diff saved to https://phabricator.wikimedia.org/P54950 and previous config saved to /var/cache/conftool/dbconfig/20240118-180456-ladsgroup.json [18:05:19] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:05:48] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) >>! In T355243#9468091, @kostajh wrote: > what's the correct way to stop a script that another user has run? Someone with root can kill it (send... [18:09:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P54951 and previous config saved to /var/cache/conftool/dbconfig/20240118-180927-marostegui.json [18:10:36] (03CR) 10Gergő Tisza: "This more or less turns the transform() call into a no-op (all it does is calculate the thumbnail URL). Which I suppose is fine as a short" [extensions/MediaModeration] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991551 (https://phabricator.wikimedia.org/T355309) (owner: 10Dreamy Jazz) [18:20:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P54953 and previous config saved to /var/cache/conftool/dbconfig/20240118-182003-ladsgroup.json [18:24:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P54954 and previous config saved to /var/cache/conftool/dbconfig/20240118-182433-marostegui.json [18:25:40] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye [18:28:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2089.codfw.wmnet with OS bullseye [18:34:44] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2091.codfw.wmnet with OS bullseye [18:35:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P54955 and previous config saved to /var/cache/conftool/dbconfig/20240118-183510-ladsgroup.json [18:39:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T354336)', diff saved to https://phabricator.wikimedia.org/P54956 and previous config saved to /var/cache/conftool/dbconfig/20240118-183940-marostegui.json [18:39:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance [18:39:45] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [18:39:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance [18:40:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T354336)', diff saved to https://phabricator.wikimedia.org/P54957 and previous config saved to /var/cache/conftool/dbconfig/20240118-184002-marostegui.json [18:40:09] (03CR) 10Jforrester: [C: 03+1] Promote wikimaniawiki to Vector 2022 as default skin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991547 (https://phabricator.wikimedia.org/T355297) (owner: 10Msz2001) [18:41:48] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) >>! In T355243#9469332, @Dreamy_Jazz wrote: > it seems you cannot call `File::transform` with the `RENDER_NOW` flag while using a job. I don't th... [18:42:35] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2092.codfw.wmnet with OS bullseye [18:45:00] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2089.codfw.wmnet with reason: host reimage [18:46:11] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) >>! In T355243#9470344, @Tgr wrote: > The root issue is that RENDER_NOW breaks Thumbor integration. The same probably happens if you make a reques... [18:47:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2093.codfw.wmnet with OS bullseye [18:48:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2089.codfw.wmnet with reason: host reimage [18:50:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T352010)', diff saved to https://phabricator.wikimedia.org/P54958 and previous config saved to /var/cache/conftool/dbconfig/20240118-185016-ladsgroup.json [18:50:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [18:50:21] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:50:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [18:50:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1238 (T352010)', diff saved to https://phabricator.wikimedia.org/P54959 and previous config saved to /var/cache/conftool/dbconfig/20240118-185038-ladsgroup.json [18:51:23] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2091.codfw.wmnet with reason: host reimage [18:54:32] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2091.codfw.wmnet with reason: host reimage [18:59:14] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2092.codfw.wmnet with reason: host reimage [18:59:50] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) >>! In T355243#9470344, @Tgr wrote: >>>! In T355243#9469332, @Dreamy_Jazz wrote: >> it seems you cannot call `File::transform` with the `R... [19:00:05] jnuche and jeena: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T1900). [19:02:24] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2092.codfw.wmnet with reason: host reimage [19:04:12] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2093.codfw.wmnet with reason: host reimage [19:06:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2089.codfw.wmnet with OS bullseye [19:07:29] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2093.codfw.wmnet with reason: host reimage [19:11:28] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2091.codfw.wmnet with OS bullseye [19:14:58] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) >>! In T355243#9470364, @Tgr wrote: > Although when I try this, there are a bunch of `Thumbor-*` headers on the response so it doesn't seem like i... [19:19:23] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353 (10RobH) [19:19:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2092.codfw.wmnet with OS bullseye [19:20:29] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353 (10RobH) [19:21:54] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355 (10RobH) [19:22:16] 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355 (10RobH) [19:23:07] 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353 (10RobH) [19:23:50] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye [19:23:59] (ProbeDown) firing: (6) Service miscweb1003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:24:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2093.codfw.wmnet with OS bullseye [19:26:21] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2095.codfw.wmnet with OS bullseye [19:31:10] (03PS17) 10Gehel: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:33:10] (03PS1) 10BCornwall: dns: Don't disable puppet/bird on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) [19:35:52] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:37:20] (03CR) 10CI reject: [V: 04-1] dns: Don't disable puppet/bird on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall) [19:40:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T354336)', diff saved to https://phabricator.wikimedia.org/P54960 and previous config saved to /var/cache/conftool/dbconfig/20240118-194024-marostegui.json [19:40:30] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [19:43:02] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2095.codfw.wmnet with reason: host reimage [19:43:14] (03CR) 10Gehel: wdqs.data_transfer: refactor spicerack class api (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:46:21] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2095.codfw.wmnet with reason: host reimage [19:47:03] (03CR) 10Ryan Kemper: [C: 03+2] wdqs graph-split: add experimental svcs [dns] - 10https://gerrit.wikimedia.org/r/991429 (https://phabricator.wikimedia.org/T354662) (owner: 10Ryan Kemper) [19:47:09] (03PS2) 10Ryan Kemper: wdqs graph-split: add experimental svcs [dns] - 10https://gerrit.wikimedia.org/r/991429 (https://phabricator.wikimedia.org/T354662) [19:47:25] (03CR) 10Ryan Kemper: [V: 03+2] wdqs graph-split: add experimental svcs [dns] - 10https://gerrit.wikimedia.org/r/991429 (https://phabricator.wikimedia.org/T354662) (owner: 10Ryan Kemper) [19:47:33] (03PS18) 10Gehel: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:48:56] !log T354662 Running `sudo -i authdns-update` on `dns1004` following merge of https://gerrit.wikimedia.org/r/c/operations/dns/+/991429 [19:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:00] T354662: Create DNS records for 3 new WDQS endpoints - https://phabricator.wikimedia.org/T354662 [19:52:01] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [19:55:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P54961 and previous config saved to /var/cache/conftool/dbconfig/20240118-195531-marostegui.json [19:58:22] (03CR) 10Dreamy Jazz: Remove RENDER_NOW from File::transform call to avoid job thumbnailing (031 comment) [extensions/MediaModeration] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991551 (https://phabricator.wikimedia.org/T355309) (owner: 10Dreamy Jazz) [19:58:33] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/991439/1160/" [puppet] - 10https://gerrit.wikimedia.org/r/991439 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [20:03:41] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2095.codfw.wmnet with OS bullseye [20:04:17] (03PS1) 10Vgutierrez: profile::lvs: Start ipip-multiqueue-optimizer on system boot [puppet] - 10https://gerrit.wikimedia.org/r/991641 (https://phabricator.wikimedia.org/T355359) [20:06:16] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/991641 (https://phabricator.wikimedia.org/T355359) (owner: 10Vgutierrez) [20:09:28] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/991641 (https://phabricator.wikimedia.org/T355359) (owner: 10Vgutierrez) [20:10:01] (03PS1) 10Dzahn: phabricator: fix source host for repo sync test [puppet] - 10https://gerrit.wikimedia.org/r/991642 (https://phabricator.wikimedia.org/T334519) [20:10:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P54962 and previous config saved to /var/cache/conftool/dbconfig/20240118-201037-marostegui.json [20:11:12] (03CR) 10Dzahn: [C: 03+2] phabricator: fix source host for repo sync test [puppet] - 10https://gerrit.wikimedia.org/r/991642 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [20:13:32] (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: fix source host for repo sync test [puppet] - 10https://gerrit.wikimedia.org/r/991642 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [20:14:07] (03PS1) 10Vgutierrez: profile::realserver::ipip: Start tcp-mss-clamper on system boot [puppet] - 10https://gerrit.wikimedia.org/r/991644 (https://phabricator.wikimedia.org/T355359) [20:14:46] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] profile::lvs: Start ipip-multiqueue-optimizer on system boot [puppet] - 10https://gerrit.wikimedia.org/r/991641 (https://phabricator.wikimedia.org/T355359) (owner: 10Vgutierrez) [20:15:27] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1162/co" [puppet] - 10https://gerrit.wikimedia.org/r/991644 (https://phabricator.wikimedia.org/T355359) (owner: 10Vgutierrez) [20:19:12] (03CR) 10Ssingh: [C: 03+1] profile::realserver::ipip: Start tcp-mss-clamper on system boot [puppet] - 10https://gerrit.wikimedia.org/r/991644 (https://phabricator.wikimedia.org/T355359) (owner: 10Vgutierrez) [20:20:16] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] profile::realserver::ipip: Start tcp-mss-clamper on system boot [puppet] - 10https://gerrit.wikimedia.org/r/991644 (https://phabricator.wikimedia.org/T355359) (owner: 10Vgutierrez) [20:23:41] (03PS19) 10Gehel: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [20:24:34] !log rsyncing phab repo data, gitlab2003 pulls from phab2002 (inactive server) - test only to see how long it will take, can be stopped [20:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T354336)', diff saved to https://phabricator.wikimedia.org/P54963 and previous config saved to /var/cache/conftool/dbconfig/20240118-202544-marostegui.json [20:25:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1236.eqiad.wmnet with reason: Maintenance [20:25:49] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [20:26:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1236.eqiad.wmnet with reason: Maintenance [20:26:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T354336)', diff saved to https://phabricator.wikimedia.org/P54964 and previous config saved to /var/cache/conftool/dbconfig/20240118-202606-marostegui.json [20:28:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T354336)', diff saved to https://phabricator.wikimedia.org/P54965 and previous config saved to /var/cache/conftool/dbconfig/20240118-202817-marostegui.json [20:29:05] (03PS1) 10Ssingh: ipip-multiqueue-optimizer/tcp-mss-clamper: update systemd units [puppet] - 10https://gerrit.wikimedia.org/r/991648 (https://phabricator.wikimedia.org/T355359) [20:30:15] (03PS2) 10Dzahn: switch phabricator server to codfw [dns] - 10https://gerrit.wikimedia.org/r/989535 (https://phabricator.wikimedia.org/T334519) [20:30:26] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1163/co" [puppet] - 10https://gerrit.wikimedia.org/r/991648 (https://phabricator.wikimedia.org/T355359) (owner: 10Ssingh) [20:31:13] (03CR) 10BCornwall: [C: 03+1] "Nit: Usually [Install] is placed on the bottom. Super small nit" [puppet] - 10https://gerrit.wikimedia.org/r/991648 (https://phabricator.wikimedia.org/T355359) (owner: 10Ssingh) [20:33:50] (03PS1) 10Dzahn: phabricator: switch active server from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/991649 (https://phabricator.wikimedia.org/T334519) [20:34:41] (03PS2) 10Dzahn: phabricator: switch active server from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/991649 (https://phabricator.wikimedia.org/T334519) [20:38:44] (03PS1) 10Dzahn: dumps: replace hardcoded phab server name with a lookup [puppet] - 10https://gerrit.wikimedia.org/r/991651 (https://phabricator.wikimedia.org/T354221) [20:40:35] (03PS12) 10EoghanGaffney: [gerrit] Add rsync job for lfs sync [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) [20:41:08] (03PS1) 10Majavah: replica_cnf_api: Reduce code duplication [puppet] - 10https://gerrit.wikimedia.org/r/991652 (https://phabricator.wikimedia.org/T355356) [20:41:11] (03PS1) 10Majavah: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) [20:41:13] (03PS1) 10Majavah: maintain-dbusers: Ignore some account deletion failures [puppet] - 10https://gerrit.wikimedia.org/r/991654 (https://phabricator.wikimedia.org/T355356) [20:42:40] (03CR) 10BCornwall: [C: 03+2] ipip-multiqueue-optimizer/tcp-mss-clamper: update systemd units [puppet] - 10https://gerrit.wikimedia.org/r/991648 (https://phabricator.wikimedia.org/T355359) (owner: 10Ssingh) [20:42:57] (03CR) 10EoghanGaffney: [V: 03+1] [gerrit] Add rsync job for lfs sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [20:43:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P54966 and previous config saved to /var/cache/conftool/dbconfig/20240118-204324-marostegui.json [20:44:07] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bullseye [20:46:37] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: Reduce code duplication [puppet] - 10https://gerrit.wikimedia.org/r/991652 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [20:46:54] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [20:47:01] (03PS2) 10Majavah: replica_cnf_api: Reduce code duplication [puppet] - 10https://gerrit.wikimedia.org/r/991652 (https://phabricator.wikimedia.org/T355356) [20:47:03] (03PS2) 10Majavah: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) [20:47:05] (03PS2) 10Majavah: maintain-dbusers: Ignore some account deletion failures [puppet] - 10https://gerrit.wikimedia.org/r/991654 (https://phabricator.wikimedia.org/T355356) [20:47:07] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: Ignore some account deletion failures [puppet] - 10https://gerrit.wikimedia.org/r/991654 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [20:52:29] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: Reduce code duplication [puppet] - 10https://gerrit.wikimedia.org/r/991652 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [20:52:46] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [20:57:50] (03PS3) 10Majavah: replica_cnf_api: Reduce code duplication [puppet] - 10https://gerrit.wikimedia.org/r/991652 (https://phabricator.wikimedia.org/T355356) [20:57:52] (03PS3) 10Majavah: replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) [20:57:54] (03PS3) 10Majavah: maintain-dbusers: Ignore some account deletion failures [puppet] - 10https://gerrit.wikimedia.org/r/991654 (https://phabricator.wikimedia.org/T355356) [20:58:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P54967 and previous config saved to /var/cache/conftool/dbconfig/20240118-205830-marostegui.json [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240118T2100) [21:00:05] Dreamy_Jazz and James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:15] \o [21:00:22] I can deploy my patch [21:01:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991555 (https://phabricator.wikimedia.org/T355216) (owner: 10Dreamy Jazz) [21:02:15] Oh, hey. [21:02:25] Hi. [21:02:37] Dreamy_Jazz: I'm happy to deploy mine once you're done. [21:02:41] (No rush!) [21:02:50] Sure. I'll ping you when my one is done. [21:02:54] Thanks [21:03:45] (03CR) 10CI reject: [V: 04-1] replica_cnf_api: Do not check for file existence [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [21:05:02] (03Merged) 10jenkins-bot: Log to statsd HTTP status codes and reduce logstash log levels [extensions/MediaModeration] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991555 (https://phabricator.wikimedia.org/T355216) (owner: 10Dreamy Jazz) [21:05:17] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:991555|Log to statsd HTTP status codes and reduce logstash log levels (T355216)]] [21:05:23] T355216: Keep a track of HTTP status codes from PhotoDNA - https://phabricator.wikimedia.org/T355216 [21:08:24] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:991555|Log to statsd HTTP status codes and reduce logstash log levels (T355216)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:27] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [21:10:44] This is probably the wrong place to ask, but where should I report a problem with Phabricator search? [21:11:27] what kind of a problem? [21:11:43] "All of the configured Fulltext Search services failed. - AphrontQueryTimeoutQueryException: Query timed out after 30 second(s)!" [21:12:17] What are you searching for? [21:12:21] hmm. is that happening on a specific query? [21:12:26] phabricator search works fine on my end [21:12:44] hmm. it's tried both things I searched, lemme try something else [21:13:15] (03PS1) 10Andrew Bogott: Trove: specify proper package version for mariadb and postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/991661 (https://phabricator.wikimedia.org/T349651) [21:13:23] oh interesting. I was searching "wikimedia.de" and I also tried a couple other random things, but "pybal" works fine [21:13:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T354336)', diff saved to https://phabricator.wikimedia.org/P54968 and previous config saved to /var/cache/conftool/dbconfig/20240118-211337-marostegui.json [21:13:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [21:13:42] T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336 [21:13:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [21:13:55] Some common terms make phab timeout Jeff_Green [21:14:14] !log Stopped MediaModeration scanning script (T351400) [21:14:17] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:991555|Log to statsd HTTP status codes and reduce logstash log levels (T355216)]] (duration: 09m 00s) [21:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:19] T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400 [21:14:26] T355216: Keep a track of HTTP status codes from PhotoDNA - https://phabricator.wikimedia.org/T355216 [21:14:37] RhinosF1: ok, thx [21:15:03] !log T351400 running on a tmux session `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30-no-render-now.txt` [21:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:23] Jeff_Green: to give you an answer to your question: whenever possible, such issues would be reportable as #phabricator in Phabricator itself. [21:15:34] Jeff_Green: https://phabricator.wikimedia.org/T258803 [21:15:46] such as the task RhinosF1 linked :) [21:15:48] urbanecm: found the task :) [21:15:51] urbanecm: ok makes sense [21:15:54] James_F: Done with my change. [21:16:08] James_F: mind pinging me once you're done please? [21:23:41] James_F: Did you get my ping? [21:28:04] (03PS4) 10Majavah: maintain-dbusers: Ignore some account deletion failures [puppet] - 10https://gerrit.wikimedia.org/r/991654 (https://phabricator.wikimedia.org/T355356) [21:29:01] 10SRE, 10SRE-Access-Requests, 10User-ItamarWMDE: Requesting access to for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10thcipriani) >>! In T354049#9468068, @ItamarWMDE wrote: > @thcipriani @MoritzMuehlenhoff @DZahn, In the same way I and @HasanAkgun_WMDE needed `restricted` access... [21:30:58] James_F: I'm going to deploy the second change in the window. [21:31:04] Sure. [21:31:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:30] Oh. I thought you were AFK. Happy for you to deploy if you prefer. [21:31:49] Either is fine. [21:31:52] I'll do it. :-) [21:31:55] Sure. [21:32:05] :) [21:32:09] RECOVERY - Host mw2394 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [21:32:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51306 bytes in 0.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:32:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991547 (https://phabricator.wikimedia.org/T355297) (owner: 10Msz2001) [21:32:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.298 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:03] (03Merged) 10jenkins-bot: Promote wikimaniawiki to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991547 (https://phabricator.wikimedia.org/T355297) (owner: 10Msz2001) [21:34:15] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:991547|Promote wikimaniawiki to Vector 2022 as default skin (T355297)]] [21:34:19] T355297: Enable Vector 2022 as default skin on Wikimania wiki - https://phabricator.wikimedia.org/T355297 [21:35:34] !log jforrester@deploy2002 jforrester and msz2001: Backport for [[gerrit:991547|Promote wikimaniawiki to Vector 2022 as default skin (T355297)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:35:55] !log jforrester@deploy2002 jforrester and msz2001: Continuing with sync [21:37:01] urbanecm: Final scap running (canaries done); over to you once it's done. [21:40:33] ack [21:41:48] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:991547|Promote wikimaniawiki to Vector 2022 as default skin (T355297)]] (duration: 07m 33s) [21:41:53] T355297: Enable Vector 2022 as default skin on Wikimania wiki - https://phabricator.wikimedia.org/T355297 [21:42:02] Done. [21:43:22] (03PS1) 10Urbanecm: Use BetaFeatures::isFeatureEnabled instead of getOption [extensions/Flow] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991561 (https://phabricator.wikimedia.org/T354288) [21:43:27] okay [21:43:28] (03CR) 10Urbanecm: [C: 03+2] Use BetaFeatures::isFeatureEnabled instead of getOption [extensions/Flow] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991561 (https://phabricator.wikimedia.org/T354288) (owner: 10Urbanecm) [21:49:32] (03Merged) 10jenkins-bot: Use BetaFeatures::isFeatureEnabled instead of getOption [extensions/Flow] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991561 (https://phabricator.wikimedia.org/T354288) (owner: 10Urbanecm) [21:50:04] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:991561|Use BetaFeatures::isFeatureEnabled instead of getOption (T354288)]] [21:50:10] T354288: TypeError: Return value of Flow\Hooks::isBetaFeatureEnabledInTalkPage() must be of the type bool, null returned - https://phabricator.wikimedia.org/T354288 [21:51:36] (03PS2) 10BCornwall: dns: Don't disable puppet/bird on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) [21:57:03] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:991561|Use BetaFeatures::isFeatureEnabled instead of getOption (T354288)]] (duration: 06m 58s) [21:57:08] T354288: TypeError: Return value of Flow\Hooks::isBetaFeatureEnabledInTalkPage() must be of the type bool, null returned - https://phabricator.wikimedia.org/T354288 [21:59:48] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2088.codfw.wmnet with OS bullseye [22:00:24] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye [22:27:56] (03PS1) 10Bking: elastic: Add Puppet 7 hieradata for elastic2086 [puppet] - 10https://gerrit.wikimedia.org/r/991674 (https://phabricator.wikimedia.org/T354959) [22:29:10] (03CR) 10Ryan Kemper: [C: 03+1] elastic: Add Puppet 7 hieradata for elastic2086 [puppet] - 10https://gerrit.wikimedia.org/r/991674 (https://phabricator.wikimedia.org/T354959) (owner: 10Bking) [22:30:21] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye [22:40:05] (03CR) 10Bking: [C: 03+2] elastic: Add Puppet 7 hieradata for elastic2086 [puppet] - 10https://gerrit.wikimedia.org/r/991674 (https://phabricator.wikimedia.org/T354959) (owner: 10Bking) [22:42:57] (03PS2) 10Andrew Bogott: Trove: specify proper package version for db backups [puppet] - 10https://gerrit.wikimedia.org/r/991661 (https://phabricator.wikimedia.org/T349651) [22:47:17] (03CR) 10Andrew Bogott: [C: 03+2] Trove: specify proper package version for db backups [puppet] - 10https://gerrit.wikimedia.org/r/991661 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott) [22:54:05] !log bking@cumin2002 START - Cookbook sre.puppet.migrate-host for host elastic2086* [22:55:38] !log bking@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host elastic2086* [22:55:47] !log bking@cumin2002 START - Cookbook sre.puppet.migrate-host for host elastic2086.codfw.wmnet [22:55:59] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:56:32] (03PS1) 10Tim Starling: fix heading style conflict with CM5 [extensions/CodeMirror] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991562 (https://phabricator.wikimedia.org/T355290) [22:56:33] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-wikifunctions_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:44] (03CR) 10Tim Starling: [C: 03+2] fix heading style conflict with CM5 [extensions/CodeMirror] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991562 (https://phabricator.wikimedia.org/T355290) (owner: 10Tim Starling) [22:59:16] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host elastic2086.codfw.wmnet [23:02:05] (03Merged) 10jenkins-bot: fix heading style conflict with CM5 [extensions/CodeMirror] (wmf/1.42.0-wmf.14) - 10https://gerrit.wikimedia.org/r/991562 (https://phabricator.wikimedia.org/T355290) (owner: 10Tim Starling) [23:07:22] (03CR) 10Dzahn: "Hi Luke, can you help me find a reviewer for this? I have been reading the team interface page and fixed some outdated links to an archive" [puppet] - 10https://gerrit.wikimedia.org/r/991651 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn) [23:13:27] !log tstarling@deploy2002 Synchronized php-1.42.0-wmf.14/extensions/CodeMirror/resources/mode/mediawiki/mediawiki.less: fix CodeMirror style bug T355290 (duration: 06m 33s) [23:13:32] T355290: Heading bolding in syntax highlighter bleeding onto all following text - https://phabricator.wikimedia.org/T355290 [23:24:00] (ProbeDown) firing: (6) Service miscweb1003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:33:50] (03PS1) 10Dzahn: phabricator: repo-sync test, use a machine in other DC [puppet] - 10https://gerrit.wikimedia.org/r/991677 (https://phabricator.wikimedia.org/T334519) [23:39:16] (03CR) 10Cwhite: [C: 03+2] httpd: ErrorLogFormat to strip fields with unavailable values [puppet] - 10https://gerrit.wikimedia.org/r/967877 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar) [23:42:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T352010)', diff saved to https://phabricator.wikimedia.org/P54969 and previous config saved to /var/cache/conftool/dbconfig/20240118-234213-ladsgroup.json [23:42:18] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:47:59] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2088.codfw.wmnet with OS bullseye [23:49:20] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2098.codfw.wmnet with OS bullseye [23:50:38] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bullseye [23:53:33] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:01] (03PS1) 10Dzahn: microsites/query_service: comment out new experimental sites [puppet] - 10https://gerrit.wikimedia.org/r/991679 [23:57:03] (03PS1) 10Ryan Kemper: wdqs graph-split: disable microsite [puppet] - 10https://gerrit.wikimedia.org/r/991680 (https://phabricator.wikimedia.org/T354658) [23:57:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P54970 and previous config saved to /var/cache/conftool/dbconfig/20240118-235720-ladsgroup.json [23:57:21] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2099.codfw.wmnet with OS bullseye [23:58:18] (03CR) 10CI reject: [V: 04-1] microsites/query_service: comment out new experimental sites [puppet] - 10https://gerrit.wikimedia.org/r/991679 (owner: 10Dzahn) [23:58:24] (03Abandoned) 10Dzahn: microsites/query_service: comment out new experimental sites [puppet] - 10https://gerrit.wikimedia.org/r/991679 (owner: 10Dzahn) [23:58:59] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state