[00:01:30] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10Papaul) [00:05:52] PROBLEM - Check systemd state on backup1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:13] 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) [00:06:34] 10SRE, 10Datacenter-Switchover, 10User-notice: September 2021 Datacenter switchover (codfw -> eqiad) - https://phabricator.wikimedia.org/T287539 (10Legoktm) [00:26:48] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:32:18] RECOVERY - Check systemd state on backup1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:42] !log Deployed patch for T290692 [00:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:27] (03PS4) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [01:33:23] (03CR) 10Jdlrobson: Unset logo config rather than set to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 (owner: 10Jdlrobson) [01:54:02] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) default username is admin and password is YourPaSsWoRd once login and issue the command "sonic-cli", any command you run after that gives you the error below on all 4 switches ` admin@sonic:~$ sonic-cli s... [02:51:36] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:53:32] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:32:08] PROBLEM - Host mw2280 is DOWN: PING CRITICAL - Packet loss = 100% [04:37:49] (03PS1) 10Effie Mouzeli: mediawiki: add interned_strings_buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720163 (https://phabricator.wikimedia.org/T290485) [04:46:56] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/720187 (https://phabricator.wikimedia.org/T290630) [04:47:41] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1017 [puppet] - 10https://gerrit.wikimedia.org/r/720187 (https://phabricator.wikimedia.org/T290630) (owner: 10Marostegui) [04:49:04] !log Depool clouddb1017:3311 [04:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:19] !log Depool clouddb1013:3311 [04:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:03] (03PS1) 10Effie Mouzeli: mwdebug: bump opcache and interned string buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720188 (https://phabricator.wikimedia.org/T280497) [04:55:30] (03PS2) 10Effie Mouzeli: mediawiki: add interned_strings_buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720163 (https://phabricator.wikimedia.org/T280497) [04:57:04] (03PS3) 10Effie Mouzeli: mediawiki: add interned_strings_buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720163 (https://phabricator.wikimedia.org/T280497) [04:59:29] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1017" [puppet] - 10https://gerrit.wikimedia.org/r/720171 [05:00:35] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1017" [puppet] - 10https://gerrit.wikimedia.org/r/720171 (owner: 10Marostegui) [05:03:57] (03PS2) 10Effie Mouzeli: scaffold: add auto_prepend_file option for PHP [deployment-charts] - 10https://gerrit.wikimedia.org/r/719973 [05:04:30] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/720189 (https://phabricator.wikimedia.org/T290630) [05:05:05] (03CR) 10jerkins-bot: [V: 04-1] dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/720189 (https://phabricator.wikimedia.org/T290630) (owner: 10Marostegui) [05:05:45] (03PS2) 10Marostegui: dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/720189 (https://phabricator.wikimedia.org/T290630) [05:06:38] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/720189 (https://phabricator.wikimedia.org/T290630) (owner: 10Marostegui) [05:06:58] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: add interned_strings_buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720163 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [05:10:14] (03PS1) 10Marostegui: Revert "dbproxy1019: Depool clouddb1013" [puppet] - 10https://gerrit.wikimedia.org/r/720172 [05:10:51] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Depool clouddb1013" [puppet] - 10https://gerrit.wikimedia.org/r/720172 (owner: 10Marostegui) [05:12:25] !log Repool clouddb1013:3311 [05:12:27] !log Repool clouddb1017:3311 [05:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:24] (03PS4) 10Effie Mouzeli: mediawiki: add interned_strings_buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720163 (https://phabricator.wikimedia.org/T280497) [05:16:54] PROBLEM - Check systemd state on cp4029 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:22:55] (03CR) 10Effie Mouzeli: mediawiki: add interned_strings_buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720163 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [05:23:00] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: add interned_strings_buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720163 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [05:27:17] (03Merged) 10jenkins-bot: mediawiki: add interned_strings_buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720163 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [05:29:18] legoktm: thanks! [05:35:31] (03PS7) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [05:36:39] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31047/console" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [05:42:08] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:18] RECOVERY - Check systemd state on cp4029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:43:19] (03PS8) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [05:44:00] (03PS2) 10Effie Mouzeli: mwdebug: bump opcache and interned string buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720188 (https://phabricator.wikimedia.org/T280497) [05:45:31] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:51] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: bump opcache and interned string buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720188 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [05:49:00] (03PS9) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [05:50:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31048/console" [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [05:51:06] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: ms-be1062, an-web1001, labstore1006, deploy1002, ms-be1051 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:52:55] (03Merged) 10jenkins-bot: mwdebug: bump opcache and interned string buffer [deployment-charts] - 10https://gerrit.wikimedia.org/r/720188 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [05:54:27] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:25] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2280.codfw.wmnet [05:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:38] !log powercycle mw2280 - no tty available in mgmt, no ssh, host frozen [05:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:53] mmm mw2280 times out in powercycle, and hardreset doesn't work [05:59:55] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:17] 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10elukey) [06:02:34] !log elukey@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2280.codfw.wmnet [06:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:16] 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10elukey) The host is currently with status `inactive` so we can do maintenance anytime! [06:03:38] ACKNOWLEDGEMENT - Host mw2280 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T290708 [06:07:24] ACKNOWLEDGEMENT - Host mc1026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Effie Mouzeli Host is being decommd - The acknowledgement expires at: 2021-10-11 06:07:00. [06:20:38] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:24:20] PROBLEM - Check systemd state on lvs4007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:25] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [06:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:52] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [06:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:55] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [06:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:04] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [06:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:15] (03CR) 10JMeybohm: [C: 03+2] Suspend mmkubernetes on connection errors [debs/rsyslog] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/715227 (https://phabricator.wikimedia.org/T289766) (owner: 10JMeybohm) [06:32:24] PROBLEM - Check systemd state on cp5011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:12] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:00] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:25] (03CR) 10DCausse: [C: 03+1] wdqs: remove codfw hourly restarts [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [06:52:09] (03CR) 10Legoktm: wdqs: remove codfw hourly restarts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720102 (https://phabricator.wikimedia.org/T290330) (owner: 10Ryan Kemper) [06:52:44] RECOVERY - Check systemd state on lvs4007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:56:41] !log disable puppet on deploy1002 and mw2254 [06:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:58] RECOVERY - Check systemd state on cp5011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210910T0700) [07:19:06] !log importes rsyslog 8.1901.0-1~bpo9+wmf2 to stretch-wikimedia - T289766 [07:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:11] T289766: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 [07:23:08] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 79, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:25:36] !log updating rsyslog to 8.1901.0-1~bpo9+wmf2 on kubernetes-staging - T289766 [07:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:41] T289766: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 [07:26:26] (03CR) 10Dzahn: [C: 03+2] Revert "planet: remove ad.huikeshoven feed" [puppet] - 10https://gerrit.wikimedia.org/r/719694 (owner: 10Dzahn) [07:31:18] 10SRE: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 (10Dzahn) I removed the offending feed but the issue was still here. Then I deleted ALL the existing state files for the "en" feed collection and ran updates again multiple times. The issue was gone. Finall... [07:31:30] 10SRE: Planet update service flapping/failing on planet1002 - https://phabricator.wikimedia.org/T289984 (10Dzahn) 05Open→03Resolved [07:31:43] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:31:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:56] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:44] (03CR) 10Filippo Giunchedi: "LGTM overall! See inline for clarification re: external labels" [alerts] - 10https://gerrit.wikimedia.org/r/720063 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [07:45:53] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:06] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:58] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab::backup remove deprication warning and deletion of config backup [puppet] - 10https://gerrit.wikimedia.org/r/719930 (https://phabricator.wikimedia.org/T288324) (owner: 10Jelto) [07:50:01] (03CR) 10Ayounsi: [C: 03+2] JSON schema, add coverage to secrets [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688) (owner: 10Ayounsi) [07:50:28] PROBLEM - Check systemd state on an-worker1104 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:46] PROBLEM - Hadoop NodeManager on an-worker1104 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:50:54] (03Merged) 10jenkins-bot: JSON schema, add coverage to secrets [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688) (owner: 10Ayounsi) [07:52:03] 10SRE-tools, 10Infrastructure-Foundations, 10homer, 10Patch-For-Review: Validate (and document) Homer config files - https://phabricator.wikimedia.org/T272688 (10ayounsi) 05Open→03Resolved a:03ayounsi All done! [07:52:57] (03PS1) 10Elukey: Add istio-system namespace config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/720233 (https://phabricator.wikimedia.org/T288829) [07:54:16] RECOVERY - Check systemd state on an-worker1104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:34] RECOVERY - Hadoop NodeManager on an-worker1104 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:57:00] (03PS3) 10Effie Mouzeli: scaffold: add more options for PHP [deployment-charts] - 10https://gerrit.wikimedia.org/r/719973 [07:57:59] !log installing ntfs-3g security updates [07:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:46] !log updating rsyslog to 8.1901.0-1~bpo9+wmf2 on kubernetes-workers - T289766 [07:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:50] T289766: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 [08:03:21] (03CR) 10Filippo Giunchedi: "LGTM! See inline, and I'll be investigating too the TODO re: logstash tests" [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [08:10:04] (03CR) 10Elukey: [C: 03+2] Add istio-system namespace config for ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/720233 (https://phabricator.wikimedia.org/T288829) (owner: 10Elukey) [08:12:34] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:47] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:04] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:59] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:08] sorry for the spam, last one worked :D [08:16:09] (03PS1) 10Dzahn: planet: replace http_proxy with https_proxy and add it to the update command [puppet] - 10https://gerrit.wikimedia.org/r/720234 (https://phabricator.wikimedia.org/T285251) [08:16:39] (03CR) 10jerkins-bot: [V: 04-1] planet: replace http_proxy with https_proxy and add it to the update command [puppet] - 10https://gerrit.wikimedia.org/r/720234 (https://phabricator.wikimedia.org/T285251) (owner: 10Dzahn) [08:20:32] (03PS2) 10Dzahn: planet: replace http_proxy with https_proxy and add it to the update command [puppet] - 10https://gerrit.wikimedia.org/r/720234 (https://phabricator.wikimedia.org/T285251) [08:25:18] (03CR) 10Dzahn: [C: 03+2] planet: replace http_proxy with https_proxy and add it to the update command [puppet] - 10https://gerrit.wikimedia.org/r/720234 (https://phabricator.wikimedia.org/T285251) (owner: 10Dzahn) [08:29:48] (03PS1) 10Dzahn: planet: fix updatejob parameters, languages_keys isn't one [puppet] - 10https://gerrit.wikimedia.org/r/720236 (https://phabricator.wikimedia.org/T285251) [08:30:35] (03CR) 10Dzahn: [C: 03+2] planet: fix updatejob parameters, languages_keys isn't one [puppet] - 10https://gerrit.wikimedia.org/r/720236 (https://phabricator.wikimedia.org/T285251) (owner: 10Dzahn) [08:32:25] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31049/console" [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [08:33:38] (03CR) 10Elukey: [C: 03+1] analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [08:37:16] !log upgrade and restart db2139 [08:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:39] (03PS1) 10Dzahn: planet: add HTTPS_PROXY with environment parameter, not directly to cmd [puppet] - 10https://gerrit.wikimedia.org/r/720239 (https://phabricator.wikimedia.org/T285251) [08:40:30] (03CR) 10Dzahn: [C: 03+2] planet: add HTTPS_PROXY with environment parameter, not directly to cmd [puppet] - 10https://gerrit.wikimedia.org/r/720239 (https://phabricator.wikimedia.org/T285251) (owner: 10Dzahn) [08:43:47] mutante: a bunch of blog posts are showing up in my feed reader now ^.^ [08:44:13] thank you :)) [08:44:20] legoktm: :) thanks for confirming. yea, I found the fix but it's still from a manual run, I am fixing it in puppet now [08:44:44] also using https_proxy for everything [08:45:59] (03PS1) 10Dzahn: planet: parameter 'environment' expects a Hash value, got String [puppet] - 10https://gerrit.wikimedia.org/r/720240 (https://phabricator.wikimedia.org/T285251) [08:48:07] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/31050/planet1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/720240 (https://phabricator.wikimedia.org/T285251) (owner: 10Dzahn) [08:55:41] (03PS5) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) [09:00:25] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:01:49] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Support TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:02:04] (03PS11) 10Vgutierrez: envoyproxy: Support TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [09:02:14] (03PS6) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) [09:02:16] (03PS9) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [09:02:18] (03PS7) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [09:05:33] (03PS1) 10Hashar: docker: add security updates to Bullseye base image [puppet] - 10https://gerrit.wikimedia.org/r/720241 [09:07:53] !log planet - deleted all state files for all languages, running fresh update via systemctl start for all languages after proxy changes (T285251) [09:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:58] T285251: Wikimedia Planet not showing any external blog posts - https://phabricator.wikimedia.org/T285251 [09:08:27] (03CR) 10Hashar: "I have spotted that while looking at a CI image installing nodejs 12. It got 12.21 from bullseye/main instead of 12.22.5 from bullseye-sec" [puppet] - 10https://gerrit.wikimedia.org/r/720241 (owner: 10Hashar) [09:14:19] (03CR) 10Filippo Giunchedi: o11y: add logstash alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [09:14:26] (03CR) 10Elukey: "Thanks to the diff in https://integration.wikimedia.org/ci/job/helm-lint/5287/console I realized that this change will also deploy tiller " [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:15:56] (03CR) 10JMeybohm: [C: 04-1] kubeflow-kfserving-inference: avoid repetitions with multi-models (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:16:29] here it comes the stream of -1s :D [09:16:40] * jayme hides [09:16:43] lol [09:18:55] thanks for the review, will check :) [09:19:08] very small things, really :-) [09:19:39] I don't know how relevant it is but you yould end up with duplicates in the environment variables specified [09:20:03] but as the more custom ones come last, that might "just work" [09:20:56] I am learning a lot, really nice improvements, I'll add them all [09:21:07] didn't know about concat and with, definitely way better [09:22:01] this is from the sprig extension helm uses in addition to go's default template functions http://masterminds.github.io/sprig/lists.html [09:22:40] better to ignore the host part of that URL, though :-D [09:22:54] (03PS1) 10Filippo Giunchedi: alerts: copy 'stat' for alert rules on deploy [puppet] - 10https://gerrit.wikimedia.org/r/720243 [09:23:46] (03PS2) 10DCausse: alertmanager: set search-platform team [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) [09:24:31] (03CR) 10DCausse: alertmanager: set search-platform team (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719931 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [09:24:56] (03PS1) 10Arturo Borrero Gonzalez: package_builder: drop transition packages [puppet] - 10https://gerrit.wikimedia.org/r/720244 [09:26:13] (03PS4) 10Dzahn: thumbor: convert generate-thumbor-age-metrics to timer [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) [09:26:50] (03PS1) 10DCausse: Revert "[wdqs] switch updater reporting topic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/720251 [09:26:54] (03CR) 10Arturo Borrero Gonzalez: "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1001/31051/" [puppet] - 10https://gerrit.wikimedia.org/r/720244 (owner: 10Arturo Borrero Gonzalez) [09:27:50] (03CR) 10Dzahn: thumbor: convert generate-thumbor-age-metrics to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [09:27:53] (03PS7) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) [09:27:55] (03PS10) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [09:27:57] (03PS8) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [09:28:50] (03CR) 10DCausse: "To be merged just after eventgate switches to eqiad." [puppet] - 10https://gerrit.wikimedia.org/r/720251 (owner: 10DCausse) [09:28:55] (03CR) 10Elukey: kubeflow-kfserving-inference: avoid repetitions with multi-models (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:29:21] all suggestions folded in, tested also with helm template [09:29:24] way better now [09:29:51] (03CR) 10Filippo Giunchedi: wip: logagent: sketch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720110 (owner: 10Herron) [09:30:23] (03CR) 10JMeybohm: [C: 04-1] Add revscoring-editquality as first ml-service to helmfile.d (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:30:55] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/31052/thumbor1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [09:31:01] !log push pfw policies - T290611 [09:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:36] elukey: regarding the tiller thing you just discovered: Let's ping jelto to see if he maybe already has plans to implement a tiller toggle or something [09:31:58] (03CR) 10Dzahn: "compiles but there are some more "Resources only in the new catalog" than expected?" [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [09:32:43] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Legoktm) If it's not too much trouble, it would be nice if cumin2001 could have a MOTD pointing you to cumin2002. If you accidentally log into cumin2001 you'll end up trying to run cookbooks t... [09:32:49] jayme: already done in #serviceops :) [09:33:06] ah, great :) [09:33:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10ayounsi) [09:33:55] (03PS1) 10Arturo Borrero Gonzalez: toolforge: package_builder: don't assume we have 2 disks starting with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/720245 [09:35:02] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/719543 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [09:35:50] (03PS1) 10Volans: quotereviewer: fix handling of pages without SKUs [software] - 10https://gerrit.wikimedia.org/r/720266 (https://phabricator.wikimedia.org/T288354) [09:37:07] (03CR) 10JMeybohm: [C: 03+1] kubeflow-kfserving-inference: avoid repetitions with multi-models (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:38:41] (03CR) 10JMeybohm: [C: 03+1] Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [09:43:41] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10MoritzMuehlenhoff) >>! In T276589#7344145, @Legoktm wrote: > If it's not too much trouble, it would be nice if cumin2001 could have a MOTD pointing you to cumin2002. If you accidentally log in... [09:50:43] (03CR) 10Effie Mouzeli: [C: 03+1] Add a timeout parameter [software/benchmw] - 10https://gerrit.wikimedia.org/r/716371 (owner: 10Alexandros Kosiaris) [09:50:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: package_builder: don't assume we have 2 disks starting with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/720245 (owner: 10Arturo Borrero Gonzalez) [09:50:50] (03CR) 10Effie Mouzeli: [C: 03+1] Using full names instead of shorthands [software/benchmw] - 10https://gerrit.wikimedia.org/r/719103 (owner: 10Alexandros Kosiaris) [09:50:58] (03CR) 10Effie Mouzeli: [C: 03+1] Fix title of load test [software/benchmw] - 10https://gerrit.wikimedia.org/r/719104 (owner: 10Alexandros Kosiaris) [09:55:33] (03CR) 10Elukey: [C: 03+2] kubeflow-kfserving-inference: avoid repetitions with multi-models [deployment-charts] - 10https://gerrit.wikimedia.org/r/719515 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:02:18] (03CR) 10Muehlenhoff: package_builder: drop transition packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720244 (owner: 10Arturo Borrero Gonzalez) [10:04:24] (03CR) 10Arturo Borrero Gonzalez: package_builder: drop transition packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720244 (owner: 10Arturo Borrero Gonzalez) [10:05:13] (03PS2) 10Arturo Borrero Gonzalez: package_builder: drop transition packages [puppet] - 10https://gerrit.wikimedia.org/r/720244 [10:07:40] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Kormat) >>! In T276589#7344207, @MoritzMuehlenhoff wrote: > We'll ditch cumin2001 very soon, it was only kept around for DBA purposes during the switchover window. That is (surprising) news t... [10:14:25] (03PS11) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [10:14:27] (03PS9) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [10:14:36] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:16:05] (03CR) 10Jbond: profile: restart postgres on first install / bootstrap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/705704 (owner: 10Filippo Giunchedi) [10:16:58] (03PS12) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [10:17:00] (03PS10) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [10:17:34] 10SRE, 10Patch-For-Review: migrate services from cumin2001 to cumin2002 - https://phabricator.wikimedia.org/T276589 (10Marostegui) >>! In T276589#7344207, @MoritzMuehlenhoff wrote: >>>! In T276589#7344145, @Legoktm wrote: >> If it's not too much trouble, it would be nice if cumin2001 could have a MOTD pointing... [10:17:49] (03Abandoned) 10Jbond: realm.pp: only check numa fact if it exists [puppet] - 10https://gerrit.wikimedia.org/r/720060 (owner: 10Jbond) [10:21:53] (03CR) 10Muehlenhoff: package_builder: drop transition packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720244 (owner: 10Arturo Borrero Gonzalez) [10:22:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/720076 (owner: 10Volans) [10:23:15] (03CR) 10JMeybohm: [C: 04-1] Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:23:42] sorry, that slipped me on first review [10:28:25] (03PS5) 10Kormat: mariadb: Page for read-only status issues in both DCs [puppet] - 10https://gerrit.wikimedia.org/r/719948 (https://phabricator.wikimedia.org/T290591) [10:31:40] PROBLEM - Check systemd state on elastic1047 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:45] (03CR) 10Kormat: [C: 04-2] mariadb: Page for read-only status issues in both DCs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/719948 (https://phabricator.wikimedia.org/T290591) (owner: 10Kormat) [10:45:27] (03PS1) 10Vgutierrez: haproxy: Allow adding/removing HTTP headers [puppet] - 10https://gerrit.wikimedia.org/r/720272 (https://phabricator.wikimedia.org/T290005) [10:45:29] (03PS1) 10Vgutierrez: haproxy: Allow loading lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/720273 (https://phabricator.wikimedia.org/T290005) [10:45:31] (03PS1) 10Vgutierrez: cache::haproxy: Manage request/response headers [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) [10:49:59] (03PS3) 10Arturo Borrero Gonzalez: package_builder: drop transitional packages [puppet] - 10https://gerrit.wikimedia.org/r/720244 [10:50:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/720244 (owner: 10Arturo Borrero Gonzalez) [10:51:21] 10SRE, 10ops-codfw: mw2280 unresponsive to powercycle and hardreset - https://phabricator.wikimedia.org/T290708 (10Dzahn) I was able to ssh to mgmt and I got this from `racadm getsel`: ` cmdstat status : 2 status_tag : COMMAND PROCESSING FAILED error : 253 error_tag : COMMAND NOT RECOGN... [10:57:15] 10SRE, 10Traffic, 10vm-requests: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) Hi @ssingh In T290599 for durum1002 you asked for only 4GB RAM but here it is 8GB RAM. It seems either we need to recreate durum1002 with more RAM or all of them only need 4? [10:58:06] RECOVERY - Check systemd state on elastic1047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:31] 10SRE, 10Traffic, 10vm-requests: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10ssingh) Hi @Dzahn: Thank you for catching this mistake and confirming before starting the process! I have updated the ticket and we do indeed need 4 GB per VM. [11:04:40] 10SRE, 10Traffic, 10vm-requests: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10ssingh) [11:09:18] (03PS1) 10Muehlenhoff: Temporarily filter port 25 on mx2001 for reimage [homer/public] - 10https://gerrit.wikimedia.org/r/720277 (https://phabricator.wikimedia.org/T286911) [11:27:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] package_builder: drop transitional packages [puppet] - 10https://gerrit.wikimedia.org/r/720244 (owner: 10Arturo Borrero Gonzalez) [11:33:12] 10SRE, 10Infrastructure-Foundations, 10Traffic: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10MoritzMuehlenhoff) [11:36:53] 10SRE, 10Infrastructure-Foundations, 10Traffic: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10MoritzMuehlenhoff) >>! In T283165#7103576, @Vgutierrez wrote: > As mentioned on the issue description, debian backported the fix for OpenSSL as it ca... [11:37:14] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:39:08] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [11:42:56] (03PS1) 10Arturo Borrero Gonzalez: toolforge: wmcs-package-build.py: use sudo in the sbuild call [puppet] - 10https://gerrit.wikimedia.org/r/720284 (https://phabricator.wikimedia.org/T273942) [11:49:59] 10SRE, 10Traffic, 10vm-requests: Please create Ganeti VMs for durum - https://phabricator.wikimedia.org/T290672 (10Dzahn) a:03Dzahn [11:53:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: wmcs-package-build.py: use sudo in the sbuild call [puppet] - 10https://gerrit.wikimedia.org/r/720284 (https://phabricator.wikimedia.org/T273942) (owner: 10Arturo Borrero Gonzalez) [11:57:27] (03PS1) 10Arturo Borrero Gonzalez: toolforge: wmcs-package-build: use newer build server as default [puppet] - 10https://gerrit.wikimedia.org/r/720289 [11:58:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: wmcs-package-build: use newer build server as default [puppet] - 10https://gerrit.wikimedia.org/r/720289 (owner: 10Arturo Borrero Gonzalez) [12:04:31] 10Puppet, 10Infrastructure-Foundations: Temporary failures for prometheus_puppet_agent_stats - https://phabricator.wikimedia.org/T290726 (10fgiunchedi) [12:21:19] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10CAS-SSO, and 3 others: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10fgiunchedi) Is this still an issue @RLazarus ? I can't reproduce it anymore [12:23:36] (03CR) 10Kormat: [C: 04-1] sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [12:24:22] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data - https://phabricator.wikimedia.org/T251155 (10fgiunchedi) This should be a prometheus-native alert in `alerts.git` nowadays [12:32:38] (03PS13) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) [12:32:40] (03PS11) 10Elukey: Rakefile: change HELMFILE_GLOB to include ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/719522 (https://phabricator.wikimedia.org/T286791) [12:33:01] (03CR) 10Elukey: Add revscoring-editquality as first ml-service to helmfile.d (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/719128 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [12:40:29] (03PS5) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 [12:42:02] (03CR) 10jerkins-bot: [V: 04-1] os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 (owner: 10Muehlenhoff) [12:43:40] (03Abandoned) 10Filippo Giunchedi: WIP role: extend centrallog's /srv if needed [puppet] - 10https://gerrit.wikimedia.org/r/554044 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [12:44:05] (03Abandoned) 10Filippo Giunchedi: prometheus: let group 'prometheus' own metrics directory [puppet] - 10https://gerrit.wikimedia.org/r/517073 (owner: 10Filippo Giunchedi) [12:44:45] (03Abandoned) 10Filippo Giunchedi: cacheproxy: default tx ring [puppet] - 10https://gerrit.wikimedia.org/r/563976 (owner: 10Filippo Giunchedi) [13:05:25] (03PS6) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 [13:14:31] 10Puppet, 10Infrastructure-Foundations: Temporary failures for prometheus_puppet_agent_stats - https://phabricator.wikimedia.org/T290726 (10jbond) I wonder if we should just drop the git_sha from her and concentrate on getting that data into logstash? [13:22:29] (03PS1) 10Kormat: debian: Upstream release 3.2.4 [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720303 [13:22:51] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) Coming back to this task since Janis has a patch that needs to be added on top of `8.1901.0-1+wmf1` :) At the time I recall that I just picked buster's upstream version for rsyslog `8.... [13:26:54] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10MoritzMuehlenhoff) >>! In T277739#7344522, @elukey wrote: > Bullseye is out and there is not `rsyslog-kubernetes` in it, maybe we could start working with upstream to have it in unstable first... [13:33:40] (03CR) 10MVernon: [C: 03+1] "LGTM :)" [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720303 (owner: 10Kormat) [13:35:36] (03CR) 10Kormat: [V: 03+2 C: 03+2] debian: Upstream release 3.2.4 [debs/orchestrator] - 10https://gerrit.wikimedia.org/r/720303 (owner: 10Kormat) [13:39:21] (03PS1) 10Effie Mouzeli: mwdebug: bump envoy CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/720313 [13:46:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] mwdebug: bump envoy CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/720313 (owner: 10Effie Mouzeli) [13:47:01] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: bump envoy CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/720313 (owner: 10Effie Mouzeli) [13:51:25] (03Merged) 10jenkins-bot: mwdebug: bump envoy CPU [deployment-charts] - 10https://gerrit.wikimedia.org/r/720313 (owner: 10Effie Mouzeli) [13:52:25] (03PS6) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [13:53:21] (03CR) 10jerkins-bot: [V: 04-1] Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [13:54:55] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:15] (03PS1) 10Jelto: gitlab::backup make config backup less verbose [puppet] - 10https://gerrit.wikimedia.org/r/720316 (https://phabricator.wikimedia.org/T288324) [13:59:21] (03CR) 10Jbond: Improve the Kerberos automatic renewal service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:02:25] (03PS7) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [14:02:54] (03CR) 10Btullis: Improve the Kerberos automatic renewal service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:05:10] (03PS6) 10Herron: wip: logagent: puppet module sketch [puppet] - 10https://gerrit.wikimedia.org/r/720110 [14:05:19] (03CR) 10Jbond: "lgtm minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:06:43] (03CR) 10jerkins-bot: [V: 04-1] wip: logagent: puppet module sketch [puppet] - 10https://gerrit.wikimedia.org/r/720110 (owner: 10Herron) [14:08:06] (03PS2) 10Vgutierrez: haproxy: Allow adding/removing HTTP headers [puppet] - 10https://gerrit.wikimedia.org/r/720272 (https://phabricator.wikimedia.org/T290005) [14:08:08] (03PS2) 10Vgutierrez: haproxy: Allow loading lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/720273 (https://phabricator.wikimedia.org/T290005) [14:08:10] (03PS2) 10Vgutierrez: cache::haproxy: Manage request/response headers [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) [14:09:03] (03PS1) 10Rishabhbhat: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary per T289752 and T289767 respectively Changes made: modified: InitialiseSettings.php Bug:T289752 Bug:T289767 Change-Id: I0868347ac76f7c97dbde3ee3c87a36e8c460a4bb [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 [14:12:22] (03PS3) 10Vgutierrez: cache::haproxy: Manage request/response headers [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) [14:14:50] (03PS7) 10Herron: wip: logagent: puppet module sketch [puppet] - 10https://gerrit.wikimedia.org/r/720110 (https://phabricator.wikimedia.org/T288620) [14:16:46] (03PS8) 10Btullis: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) [14:17:39] (03CR) 10Btullis: Improve the Kerberos automatic renewal service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:21:52] (03PS1) 10Effie Mouzeli: mwdebug: bump cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/720324 [14:22:04] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:22:06] (03CR) 10Btullis: Improve the Kerberos automatic renewal service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [14:24:26] (03PS2) 10Volans: sre.experimental.reimage: refactor PuppetDB update [cookbooks] - 10https://gerrit.wikimedia.org/r/720076 [14:24:32] (03PS3) 10Volans: sre.experimental.reimage: refactor PuppetDB update [cookbooks] - 10https://gerrit.wikimedia.org/r/720076 [14:30:14] (03PS4) 10Vgutierrez: cache::haproxy: Manage request/response headers [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) [14:31:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1: (Need By: TBD) rack/setup/install puppetmaster200[45].codfw.wmnet - https://phabricator.wikimedia.org/T289733 (10Papaul) [14:32:04] 10Puppet, 10Infrastructure-Foundations: Temporary failures for prometheus_puppet_agent_stats - https://phabricator.wikimedia.org/T290726 (10fgiunchedi) I tend to agree, since we have a path forward now with logstash + puppet reports might as well back out of the git_sha in prometheus metrics (and eliminate the... [14:34:34] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10JMeybohm) I'd say we do 2. for short term (easier to do, less nodes to update, more in line with what we currently have) plus 3. in form of trying to get a sponsored upload to debian upstream t... [14:36:16] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: bump cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/720324 (owner: 10Effie Mouzeli) [14:39:06] 10SRE, 10Infrastructure-Foundations, 10Traffic: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain - https://phabricator.wikimedia.org/T283165 (10akosiaris) p:05Triage→03Medium [14:39:47] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: refactor PuppetDB update [cookbooks] - 10https://gerrit.wikimedia.org/r/720076 (owner: 10Volans) [14:40:59] (03PS1) 10Alexandros Kosiaris: Add saisuman to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/720329 (https://phabricator.wikimedia.org/T290661) [14:41:20] (03Merged) 10jenkins-bot: mwdebug: bump cpu [deployment-charts] - 10https://gerrit.wikimedia.org/r/720324 (owner: 10Effie Mouzeli) [14:42:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add saisuman to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/720329 (https://phabricator.wikimedia.org/T290661) (owner: 10Alexandros Kosiaris) [14:43:51] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:07] (03Merged) 10jenkins-bot: sre.experimental.reimage: refactor PuppetDB update [cookbooks] - 10https://gerrit.wikimedia.org/r/720076 (owner: 10Volans) [14:45:35] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for scherukuwada - https://phabricator.wikimedia.org/T290661 (10akosiaris) 05Open→03Resolved a:03akosiaris Hello @SCherukuwada, I 've added you to the wmf ldap group. You should now have access to all the basic privileged servi... [14:48:27] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:32] (03PS2) 10Rishabhbhat: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 [14:51:07] (03CR) 10Herron: "ready to move forward, or anything else to adjust?" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/717587 (https://phabricator.wikimedia.org/T289036) (owner: 10Herron) [14:54:07] (03PS1) 10Alexandros Kosiaris: admin: Move abban to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/720330 (https://phabricator.wikimedia.org/T289775) [14:55:38] (03CR) 10Vgutierrez: "From varnishlog -n frontend on our cloud environment:" [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:57:44] (03PS2) 10Alexandros Kosiaris: admin: Move abban to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/720330 (https://phabricator.wikimedia.org/T289775) [14:59:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Move abban to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/720330 (https://phabricator.wikimedia.org/T289775) (owner: 10Alexandros Kosiaris) [15:00:02] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [15:00:09] (03PS1) 10Papaul: Add centrallog2002 to site.pp, dhcp file and netboot [puppet] - 10https://gerrit.wikimedia.org/r/720331 (https://phabricator.wikimedia.org/T289624) [15:00:45] (03PS1) 10Volans: sre.experimental.reimage: resolve PuppetDB FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/720332 [15:01:19] 10SRE, 10SRE-Access-Requests, 10Analytics, 10Patch-For-Review: Requesting access to analytics-privatedata-users group for Abban Dunne - https://phabricator.wikimedia.org/T289775 (10akosiaris) 05Open→03Resolved Hi @AbbanWMDE, Change has been merged now that it has been approved. It will take ~30mins to... [15:01:52] (03CR) 10Papaul: [C: 03+2] Add centrallog2002 to site.pp, dhcp file and netboot [puppet] - 10https://gerrit.wikimedia.org/r/720331 (https://phabricator.wikimedia.org/T289624) (owner: 10Papaul) [15:04:10] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10elukey) New plan that was discussed between me and Janis on IRC: 1) In our rsyslog repo, `git pull upstream debian/master` to get the last updates. 2) Create a new branch in our `rsyslog` repo... [15:04:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/720332 (owner: 10Volans) [15:07:38] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: resolve PuppetDB FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/720332 (owner: 10Volans) [15:10:07] 10SRE, 10observability: rsyslog-kubernetes missing in buster-wikimedia - https://phabricator.wikimedia.org/T277739 (10JMeybohm) >>! In T277739#7344860, @elukey wrote: > New plan that was discussed between me and Janis on IRC: > > 1) In our rsyslog repo, `git pull upstream debian/master` to get the last update... [15:11:04] (03Merged) 10jenkins-bot: sre.experimental.reimage: resolve PuppetDB FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/720332 (owner: 10Volans) [15:13:25] (03PS3) 10Cwhite: o11y: add logstash alerts [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) [15:13:59] (03CR) 10Cwhite: o11y: add logstash alerts (0311 comments) [alerts] - 10https://gerrit.wikimedia.org/r/720079 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [15:19:57] (03PS10) 10Elukey: kubernetes: add revscoring-editquality in the services configs [puppet] - 10https://gerrit.wikimedia.org/r/720048 (https://phabricator.wikimedia.org/T286791) [15:25:29] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add a timeout parameter [software/benchmw] - 10https://gerrit.wikimedia.org/r/716371 (owner: 10Alexandros Kosiaris) [15:25:35] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Thanks!" [software/benchmw] - 10https://gerrit.wikimedia.org/r/716371 (owner: 10Alexandros Kosiaris) [15:26:17] (03PS3) 10Cwhite: o11y: add rsyslog alerts [alerts] - 10https://gerrit.wikimedia.org/r/720063 (https://phabricator.wikimedia.org/T288726) [15:26:29] (03CR) 10Cwhite: o11y: add rsyslog alerts (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/720063 (https://phabricator.wikimedia.org/T288726) (owner: 10Cwhite) [15:27:54] !log volans@cumin1001 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [15:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:34] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['centrallog2002.codfw.wmnet'] ` Of which those **FA... [15:39:03] !log volans@cumin1001 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host sretest1001.eqiad.wmnet [15:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:03] (03PS9) 10Jbond: Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [15:44:31] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [15:44:36] (03CR) 10Jbond: [C: 03+1] Improve the Kerberos automatic renewal service [puppet] - 10https://gerrit.wikimedia.org/r/711482 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [15:55:32] (03PS1) 10Volans: sre.experimental.reimage: fix Puppet noop run [cookbooks] - 10https://gerrit.wikimedia.org/r/720340 [15:58:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/720340 (owner: 10Volans) [15:59:04] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: fix Puppet noop run [cookbooks] - 10https://gerrit.wikimedia.org/r/720340 (owner: 10Volans) [16:01:34] (03Merged) 10jenkins-bot: sre.experimental.reimage: fix Puppet noop run [cookbooks] - 10https://gerrit.wikimedia.org/r/720340 (owner: 10Volans) [16:02:49] (03PS1) 10Papaul: Add puppetmaster200[45] to site.pp, Dhcp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/720341 (https://phabricator.wikimedia.org/T289733) [16:03:23] !log volans@cumin1001 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [16:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:06] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:12:26] (03PS2) 10Papaul: Add puppetmaster200[45] to site.pp, Dhcp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/720341 (https://phabricator.wikimedia.org/T289733) [16:12:41] (03PS1) 10Jelto: helmfile.d/admin make tiller components configurable per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) [16:14:21] !log volans@cumin1001 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host sretest1001.eqiad.wmnet [16:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:59] (03CR) 10Papaul: [C: 03+2] Add puppetmaster200[45] to site.pp, Dhcp and netboot [puppet] - 10https://gerrit.wikimedia.org/r/720341 (https://phabricator.wikimedia.org/T289733) (owner: 10Papaul) [16:24:34] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` pup... [16:26:09] (03PS2) 10Jelto: helmfile.d/admin make tiller components configurable per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) [16:31:26] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10Papaul) @MoritzMuehlenhoff Bullseye installer giving me "Failed to load ldlinux.c32" [16:34:17] (03PS3) 10Jelto: helmfile.d/admin make tiller components configurable per environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) [16:40:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: REIMAGE [16:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:01] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on puppetmaster2004.codfw.wmnet with reason: REIMAGE [16:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:17] (03CR) 10Jelto: "As discussed in IRC we want to be able to enable or disable tiller components per environment. This is needed for helm3 migration as well " [deployment-charts] - 10https://gerrit.wikimedia.org/r/720342 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [16:50:55] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['puppetmaster2004.codfw.wmnet'] ` and were **ALL**... [16:54:31] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [16:58:47] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` pup... [17:00:15] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [17:01:05] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) [17:01:22] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) [17:01:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) [17:02:39] 10SRE, 10serviceops, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) Has Thumbor been upgraded, or is this waiting on {T216815}? [17:14:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster2005.codfw.wmnet with reason: REIMAGE [17:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on puppetmaster2005.codfw.wmnet with reason: REIMAGE [17:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:36] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['puppetmaster2005.codfw.wmnet'] ` and were **ALL**... [17:28:58] (03PS1) 10Jforrester: Undeploy VipsScaler: I – Disable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720355 (https://phabricator.wikimedia.org/T290759) [17:29:02] (03PS1) 10Jforrester: Undeploy VipsScaler: II – Don't load regardless of config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720356 (https://phabricator.wikimedia.org/T290759) [17:29:04] (03PS1) 10Jforrester: Undeploy VipsScaler: III – Don't set wmgUseVips, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720357 (https://phabricator.wikimedia.org/T290759) [17:29:06] (03PS1) 10Jforrester: Undeploy VipsScaler: IV – Don't load the i18n [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720358 (https://phabricator.wikimedia.org/T290759) [17:29:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1: (Need By: TBD) rack/setup/install puppetmaster200[45].codfw.wmnet - https://phabricator.wikimedia.org/T289733 (10Papaul) [17:30:16] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q1: (Need By: TBD) rack/setup/install puppetmaster200[45].codfw.wmnet - https://phabricator.wikimedia.org/T289733 (10Papaul) 05Open→03Resolved This is complete [17:56:43] (03PS1) 10Volans: sre.experimental.reimage: fix query for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/720361 [18:01:12] (03PS1) 10Jforrester: Add new config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720362 [18:01:14] (03PS1) 10Jforrester: Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 [18:01:55] (03PS2) 10Jforrester: Add new config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720362 (https://phabricator.wikimedia.org/T277932) [18:02:04] (03CR) 10jerkins-bot: [V: 04-1] Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [18:02:06] (03PS2) 10Jforrester: Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) [18:03:23] (03CR) 10jerkins-bot: [V: 04-1] Add new config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720362 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [18:03:27] (03CR) 10jerkins-bot: [V: 04-1] Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) (owner: 10Jforrester) [18:03:59] (03CR) 10Volans: [C: 03+2] "Trivial fix, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/720361 (owner: 10Volans) [18:06:58] (03Merged) 10jenkins-bot: sre.experimental.reimage: fix query for downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/720361 (owner: 10Volans) [18:07:25] (03PS3) 10Jforrester: Add new config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720362 (https://phabricator.wikimedia.org/T277932) [18:07:27] (03PS3) 10Jforrester: Drop old config names for CentralAuth denylist controls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720363 (https://phabricator.wikimedia.org/T277932) [18:08:32] !log volans@cumin1001 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [18:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:21] (03PS1) 10Ssingh: test_dns: update test_durum() to test the web application [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/720368 (https://phabricator.wikimedia.org/T289536) [18:22:51] (03CR) 10Ssingh: [C: 03+2] test_dns: update test_durum() to test the web application [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/720368 (https://phabricator.wikimedia.org/T289536) (owner: 10Ssingh) [18:24:28] (03CR) 10Legoktm: [C: 03+1] Undeploy VipsScaler: I – Disable on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720355 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [18:24:35] (03CR) 10Legoktm: [C: 03+1] Undeploy VipsScaler: II – Don't load regardless of config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720356 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [18:24:43] (03CR) 10Legoktm: [C: 03+1] Undeploy VipsScaler: III – Don't set wmgUseVips, now ignored [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720357 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [18:25:01] (03CR) 10Legoktm: [C: 03+1] "I think this could be merged with step II, but it doesn't matter that much." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720358 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [18:26:06] (03CR) 10Jforrester: Undeploy VipsScaler: IV – Don't load the i18n (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720358 (https://phabricator.wikimedia.org/T290759) (owner: 10Jforrester) [18:26:40] (03PS1) 10Ssingh: test_dns: remove redundant whitespace [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/720370 [18:27:37] (03CR) 10Ssingh: [C: 03+2] test_dns: remove redundant whitespace [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/720370 (owner: 10Ssingh) [18:33:44] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10Papaul) >>! In T289624#7345155, @ops-monitoring-bot wrote: > Completed auto-reimage of hosts: > ` > ['puppet... [18:34:11] !log volans@cumin1001 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host sretest1001.eqiad.wmnet [18:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:23] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10Papaul) [18:35:52] (03PS1) 10Volans: sre.experimental.reimage: fix typo in variable [cookbooks] - 10https://gerrit.wikimedia.org/r/720371 [18:41:12] (03CR) 10Legoktm: sre.switchdc.mediawiki: Downtime read-only checks on the DB primaries (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/718936 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [18:42:42] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10MRaishWMF) [18:45:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:46:03] (03PS1) 10Ladsgroup: mailman: Remove mailman2 config file [puppet] - 10https://gerrit.wikimedia.org/r/720374 (https://phabricator.wikimedia.org/T282303) [18:47:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:56:48] (03CR) 10RobH: [C: 03+2] quotereviewer: fix handling of pages without SKUs [software] - 10https://gerrit.wikimedia.org/r/720266 (https://phabricator.wikimedia.org/T288354) (owner: 10Volans) [19:07:39] (03PS1) 10Legoktm: helmfile.d: Add shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/720378 (https://phabricator.wikimedia.org/T289227) [19:31:18] (03PS3) 10Krinkle: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) (owner: 10Rishabhbhat) [19:33:16] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/720384 [19:49:30] (03PS1) 10Krinkle: Remove $wmgLogstashServers (step 1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720385 [19:49:32] (03PS1) 10Krinkle: Remove $wmgLogstashServers (step 2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720386 [19:52:11] (03PS1) 10Krinkle: Early adopt wgIncludejQueryMigrate=false on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720387 (https://phabricator.wikimedia.org/T280944) [20:04:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy) - https://phabricator.wikimedia.org/T290766 (10AChang_WMF) Approved by manager. [20:27:58] (03PS3) 10RobH: removed sku 403-BCLL by mistake [software] - 10https://gerrit.wikimedia.org/r/716042 [20:28:00] (03PS1) 10RobH: sku updates for configs a-e [software] - 10https://gerrit.wikimedia.org/r/720390 [20:28:32] (03CR) 10RobH: [C: 03+2] removed sku 403-BCLL by mistake [software] - 10https://gerrit.wikimedia.org/r/716042 (owner: 10RobH) [20:28:40] (03CR) 10RobH: [C: 03+2] sku updates for configs a-e [software] - 10https://gerrit.wikimedia.org/r/720390 (owner: 10RobH) [20:29:03] (03Merged) 10jenkins-bot: removed sku 403-BCLL by mistake [software] - 10https://gerrit.wikimedia.org/r/716042 (owner: 10RobH) [20:29:07] (03Merged) 10jenkins-bot: sku updates for configs a-e [software] - 10https://gerrit.wikimedia.org/r/720390 (owner: 10RobH) [20:36:26] (03CR) 10Jeena Huneidi: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/720384 (owner: 10PipelineBot) [20:40:25] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/720384 (owner: 10PipelineBot) [20:42:29] !log jhuneidi@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [20:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:50] !log jhuneidi@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [20:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:24] !log jhuneidi@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [20:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:31] (03CR) 10Legoktm: [C: 03+2] helmfile.d: Add shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/720378 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [21:16:31] (03Merged) 10jenkins-bot: helmfile.d: Add shellbox-syntaxhighlight [deployment-charts] - 10https://gerrit.wikimedia.org/r/720378 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [21:18:39] PROBLEM - Check systemd state on idp-test2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:21:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:21:25] !log legoktm@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [21:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:27:49] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [21:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:58] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [21:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:10] (03PS1) 10Ebernhardson: Include shellcheck on ci slave instances [puppet] - 10https://gerrit.wikimedia.org/r/720402 [21:45:23] RECOVERY - Check systemd state on idp-test2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:01] PROBLEM - Disk space on maps2006 is CRITICAL: DISK CRITICAL - free space: / 2621 MB (3% inode=98%): /tmp 2621 MB (3% inode=98%): /var/tmp 2621 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps2006&var-datasource=codfw+prometheus/ops [22:53:49] PROBLEM - Check systemd state on dns2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:41] RECOVERY - Check systemd state on dns2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:45] (03PS31) 10Jforrester: profile::ci::slave::labs::common: move to cinder-based storage [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [23:49:59] (03CR) 10Jforrester: "PS31: This is just a manual rebase so I can pull this into the puppetmaster and find out if it works for the new integration-agent-docker-" [puppet] - 10https://gerrit.wikimedia.org/r/670524 (https://phabricator.wikimedia.org/T277078) (owner: 10Andrew Bogott) [23:59:23] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down