[00:08:05] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:29:17] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:39:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:31] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:49:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:11] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:30:27] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:51:44] (03CR) 10Dduvall: "This change is ready for review." (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [02:52:32] (03CR) 10Dduvall: buildkitd: Provide buildkitd image for trusted GitLab runners (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:06:19] (03PS1) 10RLazarus: admin: Add dmantena to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/791480 (https://phabricator.wikimedia.org/T308294) [03:10:54] (03CR) 10RLazarus: [C: 03+2] admin: Add dmantena to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/791480 (https://phabricator.wikimedia.org/T308294) (owner: 10RLazarus) [03:11:47] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:29:17] (03PS5) 10Dduvall: buildkitd: Provide buildkitd image for trusted GitLab runners [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) [03:32:32] (03PS6) 10Dduvall: buildkitd: Provide buildkitd image for trusted GitLab runners [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) [03:43:09] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:51:53] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:58:37] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:00:43] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:34:29] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:54:17] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:03:16] 10SRE, 10ops-codfw, 10DBA: db2140 broken storage - https://phabricator.wikimedia.org/T308202 (10Marostegui) Thanks Papaul, that error is strange indeed. I have double checked the raid status and also all the controller logs and they look fine (and I can see the firmware upgrade there too) - I have also found... [05:14:07] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:16:39] (03PS1) 10KartikMistry: Enable Section Translation in bcl, is, ne, pa, ts and ur Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791481 (https://phabricator.wikimedia.org/T304828) [05:19:21] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:21:04] (03PS1) 10Marostegui: Revert "db2109: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/791486 [05:21:53] (03CR) 10Marostegui: [C: 03+2] Revert "db2109: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/791486 (owner: 10Marostegui) [05:45:27] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:52:22] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:23:24] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:32:50] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:34:42] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:54:14] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:54:40] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220513T0700) [07:01:49] (03PS7) 10Sergio Gimeno: Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:03:45] (03PS8) 10Sergio Gimeno: Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) [07:06:13] (03CR) 10Sergio Gimeno: Account creation: update live campaigns config (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [07:18:26] !log start of mwscript extensions/Echo/maintenance/removeOrphanedEvents.php --wiki=wikidatawiki --force (T308084) [07:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:33] T308084: Reduce DB space used by Echo notifications - https://phabricator.wikimedia.org/T308084 [07:20:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4001.ulsfo.wmnet [07:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4001.ulsfo.wmnet [07:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:17] !log root@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4001.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [07:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:45] !log root@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4001.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [07:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:12] (03CR) 10Filippo Giunchedi: wmflib: extend sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791309 (owner: 10Filippo Giunchedi) [07:47:12] (03PS1) 10Ayounsi: drmrs: add Init7 transit [homer/public] - 10https://gerrit.wikimedia.org/r/791554 [07:50:12] (03CR) 10Ayounsi: [C: 03+2] drmrs: add Init7 transit [homer/public] - 10https://gerrit.wikimedia.org/r/791554 (owner: 10Ayounsi) [07:50:50] (03Merged) 10jenkins-bot: drmrs: add Init7 transit [homer/public] - 10https://gerrit.wikimedia.org/r/791554 (owner: 10Ayounsi) [07:51:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:52:49] !log add init7 transit in drmrs [07:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:39] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) [07:53:54] (03PS1) 10Ayounsi: drmrs: add Init7 TE communities [homer/public] - 10https://gerrit.wikimedia.org/r/791555 [07:54:13] (03Abandoned) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [07:54:39] (03CR) 10Ayounsi: [C: 03+2] drmrs: add Init7 TE communities [homer/public] - 10https://gerrit.wikimedia.org/r/791555 (owner: 10Ayounsi) [07:55:09] (03Merged) 10jenkins-bot: drmrs: add Init7 TE communities [homer/public] - 10https://gerrit.wikimedia.org/r/791555 (owner: 10Ayounsi) [07:58:03] (03PS2) 10Filippo Giunchedi: wmflib: extend sites [puppet] - 10https://gerrit.wikimedia.org/r/791309 [07:58:05] (03PS2) 10Filippo Giunchedi: netops: add site/role to netops::check to cater for new data structure [puppet] - 10https://gerrit.wikimedia.org/r/791310 [07:58:07] (03PS3) 10Filippo Giunchedi: WIP move network routers definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) [07:58:09] (03PS5) 10Filippo Giunchedi: WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [07:59:10] !log draining ganeti4002 T307997 [07:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:15] T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 [08:03:35] !log moving s2 database from db2101 to db2097 T299920 [08:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:40] T299920: Rebalance db1102 backup source, which often causes alert spam due to network throughput - https://phabricator.wikimedia.org/T299920 [08:03:59] (03CR) 10JMeybohm: [C: 04-1] New service: image-suggestion (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [08:04:54] 10SRE, 10SRE Observability: Reminders for unhandled/unacked alerts - https://phabricator.wikimedia.org/T307958 (10fgiunchedi) In terms of implementation I think we should extend icinga-exporter to have metrics/histograms about alerts in interesting states, from there we can alert on said metric(s) and notify a... [08:08:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (didn't test it tho)" [puppet] - 10https://gerrit.wikimedia.org/r/791366 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [08:11:34] (03CR) 10Sergio Gimeno: Account creation: update live campaigns config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [08:11:52] (03PS1) 10Ayounsi: Revert "drmrs: add Init7 transit" [homer/public] - 10https://gerrit.wikimedia.org/r/791489 [08:11:59] (03PS1) 10Ayounsi: Revert "drmrs: add Init7 TE communities" [homer/public] - 10https://gerrit.wikimedia.org/r/791490 [08:12:18] (03PS38) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [08:15:12] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [08:16:54] (03CR) 10Muehlenhoff: Add an alias to target VMs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791391 (owner: 10Muehlenhoff) [08:17:02] (03CR) 10Muehlenhoff: [C: 03+2] Add an alias to target VMs [puppet] - 10https://gerrit.wikimedia.org/r/791391 (owner: 10Muehlenhoff) [08:35:32] (03PS1) 10Jcrespo: alerting_host: Remove references to dbbackups monitoring [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) [08:36:44] (03CR) 10JMeybohm: [C: 03+2] Add kubernetes admin credentials to cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/789808 (owner: 10JMeybohm) [08:39:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [08:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:22] (03CR) 10Jbond: [C: 03+1] "LGTM thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/791429 (owner: 10Ssingh) [08:41:33] (03PS2) 10Jcrespo: alerting_host: Remove references to dbbackups monitoring [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) [08:43:44] (03CR) 10Ayounsi: wmflib: extend sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791309 (owner: 10Filippo Giunchedi) [08:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:43:58] (03CR) 10Jbond: [C: 03+1] "LGTM feel free to ignore the comment" [puppet] - 10https://gerrit.wikimedia.org/r/791327 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [08:44:53] (03PS1) 10Klausman: modules: clean up special case for celery v4 in ORES [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) [08:45:19] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: contint/releases/hosts with helm installed: puppet - Could not find group deployment - https://phabricator.wikimedia.org/T307740 (10hashar) From the contint1001 /var/log/puppet.log* files, the last good run was: ` Apr 27 15:32:34 contint1001 puppe... [08:45:27] (03CR) 10jerkins-bot: [V: 04-1] modules: clean up special case for celery v4 in ORES [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [08:46:41] 10SRE-Access-Requests, 10Machine-Learning-Team: Add Aiko and Kevin to the deployment posix group - https://phabricator.wikimedia.org/T308308 (10elukey) [08:46:52] (03CR) 10Jcrespo: [C: 04-1] "This is the final state wanted after spliting the alerting for dbbackups on a separate role. Leaving it for reference (the split is not ye" [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [08:47:37] (03PS2) 10Klausman: modules: clean up special case for celery v4 in ORES [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) [08:48:30] (03PS3) 10Klausman: modules: clean up special case for celery v4 in ORES [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) [08:48:37] (03CR) 10Jcrespo: [C: 04-1] "BTW, this requires changes on the private repo, too." [puppet] - 10https://gerrit.wikimedia.org/r/791560 (https://phabricator.wikimedia.org/T283017) (owner: 10Jcrespo) [08:49:16] (03CR) 10Hashar: git: add define for abritrarily named config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791327 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [08:49:59] (03CR) 10Jbond: [C: 03+1] "change as is is also fine" [puppet] - 10https://gerrit.wikimedia.org/r/791309 (owner: 10Filippo Giunchedi) [08:50:03] (03CR) 10Michael Große: [C: 03+1] puppet_alert: Improve message [puppet] - 10https://gerrit.wikimedia.org/r/791559 (owner: 10Lucas Werkmeister (WMDE)) [08:50:14] (03PS2) 10Jelto: gitlab: allow multiple passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/790699 (https://phabricator.wikimedia.org/T307142) [08:50:50] (03CR) 10Jbond: [C: 03+1] netops: add site/role to netops::check to cater for new data structure [puppet] - 10https://gerrit.wikimedia.org/r/791310 (owner: 10Filippo Giunchedi) [08:51:34] (03CR) 10Hashar: Fix permissions/ownership of helm directories (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786269 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [08:52:34] (03PS13) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [08:53:17] (03CR) 10Jbond: [C: 03+1] "LGTM will leave the open questions to arzhel" [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [08:53:35] (03CR) 10Cathal Mooney: [C: 03+1] "Huh. I got the wrong impression on that too. OH well." [homer/public] - 10https://gerrit.wikimedia.org/r/791489 (owner: 10Ayounsi) [08:54:05] (03CR) 10Cathal Mooney: [C: 03+1] Revert "drmrs: add Init7 TE communities" [homer/public] - 10https://gerrit.wikimedia.org/r/791490 (owner: 10Ayounsi) [08:56:08] (03CR) 10Elukey: "There are a couple of places where we have the celery 5 conditionals:" [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [08:57:27] (03CR) 10Klausman: modules: clean up special case for celery v4 in ORES (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [08:58:37] (03CR) 10Elukey: modules: clean up special case for celery v4 in ORES (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [08:59:19] (03CR) 10Elukey: [C: 03+1] modules: clean up special case for celery v4 in ORES [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [09:01:00] (03CR) 10Jelto: [C: 03+2] gitlab: allow multiple passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/790699 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:01:50] 10SRE-Access-Requests, 10Machine-Learning-Team: Add Aiko and Kevin to the deployment posix group - https://phabricator.wikimedia.org/T308308 (10elukey) I may have created this task too soon, some discussion on T305729 is still happening, let's wait before proceeding. [09:01:54] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35243/console" [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [09:02:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [09:02:58] ACKNOWLEDGEMENT - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Working on this in T308267 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:03:13] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35244/console" [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [09:04:16] (03CR) 10Btullis: [V: 03+1 C: 03+2] Create new sudo rules to facilitate monitoring airflow [puppet] - 10https://gerrit.wikimedia.org/r/791366 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [09:04:19] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35245/console" [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [09:05:08] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35246/console" [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [09:05:31] (03CR) 10Klausman: [V: 03+1 C: 03+2] modules: clean up special case for celery v4 in ORES [puppet] - 10https://gerrit.wikimedia.org/r/791561 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [09:07:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [09:12:12] (03PS1) 10Jcrespo: dbbackups: Move s2 instance and backups from db2101 to db2097 [puppet] - 10https://gerrit.wikimedia.org/r/791564 (https://phabricator.wikimedia.org/T299920) [09:12:38] (03PS2) 10Jcrespo: dbbackups: Move s2 instance and backups from db2101 to db2097 [puppet] - 10https://gerrit.wikimedia.org/r/791564 (https://phabricator.wikimedia.org/T299920) [09:14:55] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup.timer,rsync-data-backup.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:14:59] (03PS3) 10Jcrespo: dbbackups: Move s2 instance and backups from db2101 to db2097 [puppet] - 10https://gerrit.wikimedia.org/r/791564 (https://phabricator.wikimedia.org/T299920) [09:15:41] ^ gitlab alert is related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/790699 I'll take a look [09:18:16] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Move s2 instance and backups from db2101 to db2097 [puppet] - 10https://gerrit.wikimedia.org/r/791564 (https://phabricator.wikimedia.org/T299920) (owner: 10Jcrespo) [09:20:15] (03PS39) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [09:20:45] (03PS1) 10Jbond: C:helm: make the group permissions on helm_cache configurable [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) [09:20:51] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [09:21:07] (03CR) 10Jbond: "FYI https://gerrit.wikimedia.org/r/c/operations/puppet/+/791565" [puppet] - 10https://gerrit.wikimedia.org/r/786269 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [09:21:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35247/console" [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [09:22:07] (03CR) 10Jbond: C:helm: make the group permissions on helm_cache configurable [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [09:22:32] (03PS40) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [09:24:04] (03PS6) 10Jbond: P:ssh::server: add support for accept env [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) [09:24:12] (03CR) 10Jbond: [C: 03+2] P:ssh::server: add support for accept env [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [09:25:56] (03PS7) 10Jbond: C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) [09:26:04] (03CR) 10Jbond: [C: 03+2] C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [09:26:21] RECOVERY - Check systemd state on gitlab1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:26] (03CR) 10Jbond: [C: 03+2] P:ssh::server: add support for accept env (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791387 (https://phabricator.wikimedia.org/T307565) (owner: 10Jbond) [09:27:57] (03CR) 10Jelto: [C: 03+2] "change looks good on gitlab1001. I tried manually running rsync using" [puppet] - 10https://gerrit.wikimedia.org/r/790699 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:28:01] (03PS8) 10Jbond: C:ssh:client: Add ability to manage ssh_config file [puppet] - 10https://gerrit.wikimedia.org/r/791396 (https://phabricator.wikimedia.org/T307565) [09:35:55] (03CR) 10Ayounsi: WIP move network routers definitions to hiera (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [09:51:08] (03PS1) 10Jbond: P:ssh::client: Add GSSAPIDelegateCredentials support to ssh::client [puppet] - 10https://gerrit.wikimedia.org/r/791567 [09:51:25] 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10akosiaris) For what is worth, I think we 've peaked. In the 30day graph {F35137940} we can see the increase in traffic. However, it's so low in volume (prometheus counts 552 apns in the l... [09:52:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35248/console" [puppet] - 10https://gerrit.wikimedia.org/r/791567 (owner: 10Jbond) [09:58:45] (03CR) 10Vgutierrez: [C: 03+1] P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 (owner: 10Jbond) [10:02:42] PROBLEM - Check systemd state on acmechief1001 is CRITICAL: CRITICAL - degraded: The following units failed: acme-chief-certs-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:49] uh.. [10:02:51] * vgutierrez checking [10:03:34] (03PS41) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [10:03:44] May 13 10:00:00 acmechief1001 acme-chief-certs-sync[8089]: /etc/ssh/ssh_config: line 2: Bad configuration option: hosts [10:03:44] May 13 10:00:00 acmechief1001 acme-chief-certs-sync[8089]: /etc/ssh/ssh_config: terminating, 1 bad configuration options [10:03:44] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [10:03:49] ^^ is that you jbond? [10:05:23] asking cause (41c4b65190) Jbond - C:ssh:client: Add ability to manage ssh_config file' got merged [10:05:48] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: sre.hosts.reimage cookbook dosn't like different LC_ALL environments - https://phabricator.wikimedia.org/T307565 (10jbond) 05Openβ†’03Resolved a:03jbond We have now disabled sending and accepting LANG and LC environment variabl... [10:07:37] (03PS1) 10Vgutierrez: C:ssh:client: Fix wrong Hosts config option [puppet] - 10https://gerrit.wikimedia.org/r/791569 (https://phabricator.wikimedia.org/T307565) [10:07:47] (03CR) 10jerkins-bot: [V: 04-1] C:ssh:client: Fix wrong Hosts config option [puppet] - 10https://gerrit.wikimedia.org/r/791569 (https://phabricator.wikimedia.org/T307565) (owner: 10Vgutierrez) [10:08:49] (03PS1) 10Btullis: Fix the sudoers entry for airflow service checks [puppet] - 10https://gerrit.wikimedia.org/r/791570 (https://phabricator.wikimedia.org/T307102) [10:08:58] (03CR) 10jerkins-bot: [V: 04-1] Fix the sudoers entry for airflow service checks [puppet] - 10https://gerrit.wikimedia.org/r/791570 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [10:09:03] jbond, moritzm https://gerrit.wikimedia.org/r/c/operations/puppet/+/791569/ should fix it [10:10:51] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35249/console" [puppet] - 10https://gerrit.wikimedia.org/r/791570 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [10:11:18] I'm wondering if jerkins-bot isn't able to checkout the git repo due to that SSH change [10:12:40] (03PS1) 10Jbond: O:cache::text: move netbox-next to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) [10:12:46] (03PS1) 10Jbond: netbox-next: move netbox-next to caching infrastructre [dns] - 10https://gerrit.wikimedia.org/r/791572 (https://phabricator.wikimedia.org/T296452) [10:12:50] (03CR) 10jerkins-bot: [V: 04-1] O:cache::text: move netbox-next to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:12:54] (03CR) 10jerkins-bot: [V: 04-1] netbox-next: move netbox-next to caching infrastructre [dns] - 10https://gerrit.wikimedia.org/r/791572 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:13:03] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35250/console" [puppet] - 10https://gerrit.wikimedia.org/r/791569 (https://phabricator.wikimedia.org/T307565) (owner: 10Vgutierrez) [10:13:04] RECOVERY - Checks that the airflow database for airflow analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:13:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35251/console" [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:13:22] (03PS42) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [10:13:32] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [10:13:46] (03PS2) 10Btullis: Fix the sudoers entry for airflow service checks [puppet] - 10https://gerrit.wikimedia.org/r/791570 (https://phabricator.wikimedia.org/T307102) [10:13:55] (03CR) 10jerkins-bot: [V: 04-1] Fix the sudoers entry for airflow service checks [puppet] - 10https://gerrit.wikimedia.org/r/791570 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [10:15:28] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:17:16] (03CR) 10Muehlenhoff: [C: 03+1] "Doh! Thanks, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/791569 (https://phabricator.wikimedia.org/T307565) (owner: 10Vgutierrez) [10:18:12] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] C:ssh:client: Fix wrong Hosts config option [puppet] - 10https://gerrit.wikimedia.org/r/791569 (https://phabricator.wikimedia.org/T307565) (owner: 10Vgutierrez) [10:18:36] hmm let me disable puppet on gerrit1001 and fix that manually [10:18:58] !log disable puppet on gerrit1001 to fix /etc/ssh/ssh_config [10:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:38] (03PS2) 10Jbond: C:ssh:client: Fix wrong Hosts config option [puppet] - 10https://gerrit.wikimedia.org/r/791569 (https://phabricator.wikimedia.org/T307565) (owner: 10Vgutierrez) [10:19:47] (03CR) 10jerkins-bot: [V: 04-1] C:ssh:client: Fix wrong Hosts config option [puppet] - 10https://gerrit.wikimedia.org/r/791569 (https://phabricator.wikimedia.org/T307565) (owner: 10Vgutierrez) [10:20:27] vgutierrez: Is this SSH patch why gerrit is giving me an unexpected merge error at the moment? [10:20:53] that was my theory... [10:21:12] but it looks like it isn't that straight-forward [10:21:17] hashar: are you able to help with a gerrit issue [10:21:28] sure always! [10:21:49] i sent out a patch that broke ssh everywhere and now gerrit cant pass ci [10:21:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/791569 [10:22:19] funny enough puppet didn't reload gerrit on gerrit1001 when applied it [10:22:23] gerrit has an embedded ssh daemon on port 29418 so should not be affected [10:22:36] else you can by pass CI entirely [10:22:38] hashar: what's messed up is the ssh client [10:22:39] i think its more related to how it talks to ci [10:22:44] not the server [10:22:44] by removing the jenkins-bot verified -1 vote [10:22:54] which removes the veto and should let you submit the patch [10:22:58] (03CR) 10Jbond: [V: 03+2] C:ssh:client: Fix wrong Hosts config option [puppet] - 10https://gerrit.wikimedia.org/r/791569 (https://phabricator.wikimedia.org/T307565) (owner: 10Vgutierrez) [10:23:07] alternative is to do the hack on the puppet master git repository directly [10:23:10] run puppet [10:23:15] ack that worked sorry didn;t thinkg of that [10:23:22] recheck the CI change which should make jenkins-bot vote verified +1 [10:23:36] or in short there are two ways to by pass CI: [10:23:41] 1) remove the jenkins-bot-1 [10:23:55] 2) bypass Gerrit entirely and do a git revert in the local repo [10:24:27] the merge failure reported on the gerrit changes comes from zuul. Not sure why it failed [10:24:58] stderr: 'fatal: ssh variant 'simple' does not support setting port' [10:24:58] :) [10:25:36] I am guessing that is because CI fetches patches from Gerrit using the cli ssh [10:26:02] (03PS43) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [10:26:12] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [10:26:34] hashar: thanks i have managed to deploy the fix now [10:26:42] \o/ [10:26:57] vgutierrez: are tyhere any machines you need me to prioritise fixing (looking at cumin/basition allready) [10:27:17] I triggered a puppet run on acmechief1001 [10:27:26] cause that's what pinged on icinga [10:27:51] May 13 10:27:36 acmechief1001 systemd[1]: acme-chief-certs-sync.service: Succeeded. [10:27:53] lovely :) [10:28:01] ack thanks [10:28:20] no problem [10:29:22] (03PS44) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [10:29:25] hmmm sshd provides -T to check the config [10:29:31] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [10:29:37] ssh client doesn't provide an analogue feature? [10:29:53] we could use that on puppet to validate the syntax of the generated config [10:30:07] vgutierrez: ack good idea ill add that [10:31:09] (03CR) 10Jbond: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:32:54] (03PS2) 10Jbond: O:cache::text: move netbox-next to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) [10:33:06] (03PS3) 10Jbond: O:cache::text: move netbox-next to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) [10:33:32] (03PS2) 10Jbond: netbox-next: move netbox-next to caching infrastructure [dns] - 10https://gerrit.wikimedia.org/r/791572 (https://phabricator.wikimedia.org/T296452) [10:33:47] (03PS3) 10Jbond: netbox-next: move netbox-next to caching infrastructure [dns] - 10https://gerrit.wikimedia.org/r/791572 (https://phabricator.wikimedia.org/T296452) [10:34:30] (03CR) 10jerkins-bot: [V: 04-1] O:cache::text: move netbox-next to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:35:12] (03PS4) 10Jbond: O:cache::text: move netbox-next to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) [10:35:23] RECOVERY - Check systemd state on acmechief1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:31] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:39:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 (owner: 10Jbond) [10:41:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti4002.ulsfo.wmnet with reason: Remove from cluster for eventual reimage [10:41:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti4002.ulsfo.wmnet with reason: Remove from cluster for eventual reimage [10:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:38] (03CR) 10Ayounsi: [C: 03+1] netbox-next: move netbox-next to caching infrastructure [dns] - 10https://gerrit.wikimedia.org/r/791572 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:42:01] (03PS8) 10Jbond: P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 [10:44:59] (03CR) 10Jbond: [C: 03+2] P:cache::varnish::frontend: Update lookup keys [puppet] - 10https://gerrit.wikimedia.org/r/768762 (owner: 10Jbond) [10:46:45] (03PS3) 10Btullis: Fix the sudoers entry for airflow service checks [puppet] - 10https://gerrit.wikimedia.org/r/791570 (https://phabricator.wikimedia.org/T307102) [10:47:40] (03CR) 10Btullis: [C: 03+2] Fix the sudoers entry for airflow service checks [puppet] - 10https://gerrit.wikimedia.org/r/791570 (https://phabricator.wikimedia.org/T307102) (owner: 10Btullis) [10:51:51] (03CR) 10Hnowlan: New service: image-suggestion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [10:52:04] (03PS8) 10Hnowlan: New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) [10:52:05] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) ganeti4002 is from the same batch. I've migrated instances, removed it from the cluster for the reimage and downtimed it. @RobH Can you please update it to the same firmware a... [10:52:25] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoffβ†’03RobH [10:52:44] (03PS1) 10Jcrespo: prometheus: Avoid warnings on the mysqld exporter config generator [puppet] - 10https://gerrit.wikimedia.org/r/791580 [10:54:55] (03PS1) 10Giuseppe Lavagetto: docker-registry: add build2001 to the authorized hosts [puppet] - 10https://gerrit.wikimedia.org/r/791582 [10:55:37] !log installing idp-test2002 T308214 [10:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:42] T308214: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 [10:56:32] (03PS1) 10Alexandros Kosiaris: WIP: Add mc2038-mc2055 [puppet] - 10https://gerrit.wikimedia.org/r/791583 (https://phabricator.wikimedia.org/T293012) [10:56:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/791582 (owner: 10Giuseppe Lavagetto) [10:57:43] PROBLEM - SSH on ms-be1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:58:53] RECOVERY - SSH on ms-be1061 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:00:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker-registry: add build2001 to the authorized hosts [puppet] - 10https://gerrit.wikimedia.org/r/791582 (owner: 10Giuseppe Lavagetto) [11:00:36] (03PS1) 10Cathal Mooney: Add new subnets for cloudsw expansion Eqiad to netops infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) [11:01:21] (03PS1) 10Jbond: netbox: create discovery record for netbox [dns] - 10https://gerrit.wikimedia.org/r/791586 (https://phabricator.wikimedia.org/T296452) [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:01:57] (03PS1) 10Jbond: services: Add DNS discovery record for netbox [puppet] - 10https://gerrit.wikimedia.org/r/791588 (https://phabricator.wikimedia.org/T296452) [11:02:37] (03PS2) 10Jbond: services: Add DNS discovery record for netbox [puppet] - 10https://gerrit.wikimedia.org/r/791588 (https://phabricator.wikimedia.org/T296452) [11:03:10] (03CR) 10jerkins-bot: [V: 04-1] Add new subnets for cloudsw expansion Eqiad to netops infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [11:03:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] Double the number of eventgate_analytics_external replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/791320 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [11:03:31] (03CR) 10jerkins-bot: [V: 04-1] Double the number of eventgate_analytics_external replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/791320 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [11:05:26] (03CR) 10Ayounsi: [C: 03+1] "To be tested but the logic lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [11:08:02] (03PS1) 10Jbond: O:cache::text: Move netbox to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791589 (https://phabricator.wikimedia.org/T296452) [11:09:43] RECOVERY - Checks that the airflow database for airflow research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:09:49] 10SRE-tools, 10Infrastructure-Foundations, 10serviceops: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10Joe) 05Openβ†’03Resolved [11:10:55] RECOVERY - Checks that the local airflow scheduler for airflow @research is working properly on an-airflow1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-research /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:11:58] (03PS5) 10Jbond: O:cache::text: move netbox-next to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) [11:12:00] (03PS3) 10Jbond: services: Add DNS discovery record for netbox [puppet] - 10https://gerrit.wikimedia.org/r/791588 (https://phabricator.wikimedia.org/T296452) [11:12:02] (03PS2) 10Jbond: O:cache::text: Move netbox to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791589 (https://phabricator.wikimedia.org/T296452) [11:14:17] (03CR) 10Jbond: rake_modules: add check for spdk licence header (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [11:14:25] (03CR) 10Jbond: [C: 03+2] rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [11:14:34] (03PS22) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [11:14:44] (03CR) 10Jbond: [C: 03+2] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [11:18:43] (03CR) 10Vgutierrez: [C: 03+1] "TLS material looks good on both SNI and non-SNI connections:" [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:20:19] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35253/console" [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:21:58] (03PS2) 10Cathal Mooney: Add new subnets for cloudsw expansion Eqiad to netops infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) [11:24:23] (03CR) 10jerkins-bot: [V: 04-1] Add new subnets for cloudsw expansion Eqiad to netops infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [11:25:52] (03PS23) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [11:25:54] (03PS7) 10Jbond: apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) [11:27:23] (03CR) 10Jbond: [C: 03+2] O:cache::text: move netbox-next to the cacheing infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791571 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:28:27] (03CR) 10Jbond: [C: 03+2] apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [11:28:31] (03CR) 10Jbond: [C: 03+2] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [11:36:23] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7921343, @cmooney wrote: >> Even in the legacy setup (pre row e/f) adding new nodes requires manual er... [11:37:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791320 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [11:38:02] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:39:46] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:40:58] !log installing idp-test1002 T308214 [11:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:06] T308214: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 [11:41:50] (03Merged) 10jenkins-bot: Double the number of eventgate_analytics_external replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/791320 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [11:42:28] (03PS3) 10Cathal Mooney: Add new subnets for cloudsw expansion Eqiad to netops infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) [11:44:59] (03CR) 10jerkins-bot: [V: 04-1] Add new subnets for cloudsw expansion Eqiad to netops infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [11:47:07] (03CR) 10Jbond: [C: 03+2] netbox-next: move netbox-next to caching infrastructure [dns] - 10https://gerrit.wikimedia.org/r/791572 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:47:10] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [11:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:17] (03PS4) 10Cathal Mooney: Add new subnets for cloudsw expansion Eqiad to netops infrastructure [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) [11:51:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:57:19] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [11:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:15] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [11:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:45] weird, got an error [11:59:49] retrying [12:03:56] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:07:10] (03PS1) 10Alexandros Kosiaris: eventgate_analytics_external: Lower to 35 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/791594 (https://phabricator.wikimedia.org/T306181) [12:08:33] (03CR) 10Btullis: [C: 03+1] eventgate_analytics_external: Lower to 35 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/791594 (https://phabricator.wikimedia.org/T306181) (owner: 10Alexandros Kosiaris) [12:09:24] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [12:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/791594 (https://phabricator.wikimedia.org/T306181) (owner: 10Alexandros Kosiaris) [12:12:09] (03CR) 10Marostegui: "What is the expected behaviour if you run the schema change with --include-masters? Will it try to alter also the primary master for the a" [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [12:15:21] (03CR) 10Filippo Giunchedi: [C: 03+2] "Ack, thanks! Proceeding" [puppet] - 10https://gerrit.wikimedia.org/r/791309 (owner: 10Filippo Giunchedi) [12:16:10] (03PS1) 10Cathal Mooney: Add DHCP config files for new cloud host nets and rename older files [puppet] - 10https://gerrit.wikimedia.org/r/791595 (https://phabricator.wikimedia.org/T304989) [12:16:56] (03CR) 10jerkins-bot: [V: 04-1] Add DHCP config files for new cloud host nets and rename older files [puppet] - 10https://gerrit.wikimedia.org/r/791595 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [12:18:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2140 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P27824 and previous config saved to /var/cache/conftool/dbconfig/20220513-121832-marostegui.json [12:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:43] (03PS1) 10Marostegui: Revert "db2140: Broken host" [puppet] - 10https://gerrit.wikimedia.org/r/791494 [12:19:13] 10SRE, 10ops-codfw, 10DBA: db2140 broken storage - https://phabricator.wikimedia.org/T308202 (10Marostegui) 05Openβ†’03Resolved a:05Marosteguiβ†’03Papaul Thanks Papaul! [12:19:35] (03CR) 10Marostegui: [C: 03+2] Revert "db2140: Broken host" [puppet] - 10https://gerrit.wikimedia.org/r/791494 (owner: 10Marostegui) [12:19:35] 10SRE, 10ops-codfw, 10DBA: db2140 broken storage - https://phabricator.wikimedia.org/T308202 (10Marostegui) db2140 has caught up and seems to be replicating just fine. Closing this for now and if it crashes again we'll reopen [12:21:42] (03CR) 10Ayounsi: "+1 on the data.yaml change. Can't vouch for the OSD side." [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [12:23:37] (03CR) 10Filippo Giunchedi: WIP move network routers definitions to hiera (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:24:43] (03CR) 10Ladsgroup: auto_schema: Make alter non-blocking on master of primary dc (031 comment) [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [12:25:31] (03PS1) 10Muehlenhoff: Add SPDX headers to debdeploy/adduser/puppetboard modules [puppet] - 10https://gerrit.wikimedia.org/r/791596 (https://phabricator.wikimedia.org/T308013) [12:26:22] (03PS3) 10Filippo Giunchedi: netops: add site/role to netops::check to cater for new data structure [puppet] - 10https://gerrit.wikimedia.org/r/791310 [12:26:24] (03PS4) 10Filippo Giunchedi: WIP move network routers definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) [12:26:26] (03PS6) 10Filippo Giunchedi: WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [12:27:38] (03CR) 10jerkins-bot: [V: 04-1] WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:29:41] (03CR) 10jerkins-bot: [V: 04-1] WIP move network routers definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [12:30:15] (03PS1) 10Alexandros Kosiaris: WIP: kubernetes: Use netbox data to populate topology labels [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) [12:31:43] (03CR) 10Marostegui: "I think we probably a --primary-master is better, otherwise, ignore by default any primary DC master (when replicas set to None)" [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [12:31:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] "dnm. Still has a number of issues:" [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [12:33:11] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 04-1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35254/console" [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [12:33:28] (03CR) 10Filippo Giunchedi: "A nit inline, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/791580 (owner: 10Jcrespo) [12:37:27] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [12:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:55] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [12:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:16] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [12:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:33] (03PS5) 10Filippo Giunchedi: WIP move network routers definitions to hiera [puppet] - 10https://gerrit.wikimedia.org/r/777347 (https://phabricator.wikimedia.org/T169860) [12:38:35] (03PS7) 10Filippo Giunchedi: WIP test replacing smokeping with blackbox exporter [puppet] - 10https://gerrit.wikimedia.org/r/777330 (https://phabricator.wikimedia.org/T169860) [12:38:38] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [12:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:33] (03PS1) 10Jbond: C:cfssl: Add spdx licences [puppet] - 10https://gerrit.wikimedia.org/r/791598 [12:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:44:06] (03CR) 10JMeybohm: New service: image-suggestion (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [12:44:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35255/console" [puppet] - 10https://gerrit.wikimedia.org/r/791310 (owner: 10Filippo Giunchedi) [12:45:10] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] netops: add site/role to netops::check to cater for new data structure [puppet] - 10https://gerrit.wikimedia.org/r/791310 (owner: 10Filippo Giunchedi) [12:45:45] (03PS2) 10Cathal Mooney: Add DHCP config files for new cloud host nets and rename older files [puppet] - 10https://gerrit.wikimedia.org/r/791595 (https://phabricator.wikimedia.org/T304989) [12:46:50] (03CR) 10JMeybohm: [C: 04-1] Add helmfile configuration for image-suggestion (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [12:50:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/791598 (owner: 10Jbond) [12:51:20] (03PS1) 10Jelto: gitlab: use gitlab1003 as replia/passive host [puppet] - 10https://gerrit.wikimedia.org/r/791599 (https://phabricator.wikimedia.org/T307142) [12:52:38] (03CR) 10Filippo Giunchedi: prometheus::blackbox::check: add new blackbox exporter check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [12:53:45] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35256/console" [puppet] - 10https://gerrit.wikimedia.org/r/791599 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [12:56:17] (03PS2) 10Alexandros Kosiaris: WIP: kubernetes: Use netbox data to populate topology labels [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) [12:58:04] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35257/console" [puppet] - 10https://gerrit.wikimedia.org/r/791597 (https://phabricator.wikimedia.org/T270191) (owner: 10Alexandros Kosiaris) [13:00:34] (03PS1) 10Cathal Mooney: Add new cloudsw to rancid for config backup [puppet] - 10https://gerrit.wikimedia.org/r/791600 (https://phabricator.wikimedia.org/T304989) [13:00:57] (03PS4) 10Jbond: services: Add DNS discovery record for netbox [puppet] - 10https://gerrit.wikimedia.org/r/791588 (https://phabricator.wikimedia.org/T296452) [13:02:53] (03CR) 10Rosalie Perside (WMDE): [C: 03+1] puppet_alert: Improve message [puppet] - 10https://gerrit.wikimedia.org/r/791559 (owner: 10Lucas Werkmeister (WMDE)) [13:03:20] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Ladsgroup) [13:04:11] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:05:27] (03PS1) 10Jbond: DO NOT MERGE: demo rake task [puppet] - 10https://gerrit.wikimedia.org/r/791601 [13:05:29] (03CR) 10Ayounsi: [C: 03+1] Add new cloudsw to rancid for config backup [puppet] - 10https://gerrit.wikimedia.org/r/791600 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [13:08:12] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: demo rake task [puppet] - 10https://gerrit.wikimedia.org/r/791601 (owner: 10Jbond) [13:09:47] (03PS2) 10Jbond: DO NOT MERGE: demo rake task [puppet] - 10https://gerrit.wikimedia.org/r/791601 [13:10:44] (03CR) 10Ayounsi: [C: 03+1] "+1 if PCC is happy" [puppet] - 10https://gerrit.wikimedia.org/r/791595 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [13:12:29] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: demo rake task [puppet] - 10https://gerrit.wikimedia.org/r/791601 (owner: 10Jbond) [13:12:53] (03PS3) 10Jbond: DO NOT MERGE: demo rake task [puppet] - 10https://gerrit.wikimedia.org/r/791601 [13:12:55] (03PS1) 10Jbond: rake: log later [puppet] - 10https://gerrit.wikimedia.org/r/791604 [13:13:07] (03CR) 10Jbond: [V: 03+2 C: 03+2] rake: log later [puppet] - 10https://gerrit.wikimedia.org/r/791604 (owner: 10Jbond) [13:14:57] (03PS1) 10TheDJ: Remove unused OggThumbLocation config variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791605 (https://phabricator.wikimedia.org/T308191) [13:16:00] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: demo rake task [puppet] - 10https://gerrit.wikimedia.org/r/791601 (owner: 10Jbond) [13:31:36] (03Abandoned) 10Jbond: rake: test spdx::check:new_files CI check [puppet] - 10https://gerrit.wikimedia.org/r/790749 (owner: 10Jbond) [13:35:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35258/console" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [13:36:52] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) >>! In T306649#7926930, @akosiaris wrote: >> >> If the idea is ok, I'd propose to use labels like `wikimedia.org/node-l... [13:43:14] (03CR) 10Jbond: [C: 03+2] C:cfssl: Add spdx licences [puppet] - 10https://gerrit.wikimedia.org/r/791598 (owner: 10Jbond) [13:43:31] 10SRE, 10serviceops: Service Ops SRE support for iOS notifications update - https://phabricator.wikimedia.org/T306397 (10Tsevener) @akosiaris cool, thanks! My instinct is that it feels a bit low - I wonder if pushes are getting dropped somewhere. It would be cool if we could somehow check how many Echo notific... [13:44:42] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: update docstrings to use YARD-style tags [puppet] - 10https://gerrit.wikimedia.org/r/791429 (owner: 10Ssingh) [13:45:02] jbond: ok to merge your change? :) [13:45:10] Jbond: C:cfssl: Add spdx licences (68ad467ee1) [13:48:45] (03PS1) 10Jelto: wikimedia.org: add gitlab-replica-new records [dns] - 10https://gerrit.wikimedia.org/r/791608 (https://phabricator.wikimedia.org/T307142) [13:50:48] (03PS1) 10Elukey: deployment-prep: update hostname of deployment-puppetdb03 [puppet] - 10https://gerrit.wikimedia.org/r/791609 (https://phabricator.wikimedia.org/T307762) [13:53:06] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35259/console" [puppet] - 10https://gerrit.wikimedia.org/r/791609 (https://phabricator.wikimedia.org/T307762) (owner: 10Elukey) [13:53:17] (03Abandoned) 10Elukey: deployment-prep: update hostname of deployment-puppetdb03 [puppet] - 10https://gerrit.wikimedia.org/r/791609 (https://phabricator.wikimedia.org/T307762) (owner: 10Elukey) [13:58:41] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:00:31] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:03:09] stepping away for ~10 mins, please feel free to merge my change on puppetmaster [14:04:15] (03CR) 10Giuseppe Lavagetto: requestctl: add validate command (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/791363 (https://phabricator.wikimedia.org/T307905) (owner: 10Giuseppe Lavagetto) [14:06:41] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:06:55] (03PS2) 10Giuseppe Lavagetto: requestctl: add validate command [software/conftool] - 10https://gerrit.wikimedia.org/r/791363 (https://phabricator.wikimedia.org/T307905) [14:06:57] (03PS2) 10Giuseppe Lavagetto: requestctl: update readme with all pending changes [software/conftool] - 10https://gerrit.wikimedia.org/r/791364 [14:06:59] (03PS2) 10Giuseppe Lavagetto: Raise an error if wrong tags are used in a query. [software/conftool] - 10https://gerrit.wikimedia.org/r/790980 (https://phabricator.wikimedia.org/T308100) [14:07:01] (03PS2) 10Giuseppe Lavagetto: New version 2.2.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/791365 [14:09:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: add validate command [software/conftool] - 10https://gerrit.wikimedia.org/r/791363 (https://phabricator.wikimedia.org/T307905) (owner: 10Giuseppe Lavagetto) [14:11:38] (03Merged) 10jenkins-bot: requestctl: add validate command [software/conftool] - 10https://gerrit.wikimedia.org/r/791363 (https://phabricator.wikimedia.org/T307905) (owner: 10Giuseppe Lavagetto) [14:16:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: update readme with all pending changes [software/conftool] - 10https://gerrit.wikimedia.org/r/791364 (owner: 10Giuseppe Lavagetto) [14:18:54] (03Merged) 10jenkins-bot: requestctl: update readme with all pending changes [software/conftool] - 10https://gerrit.wikimedia.org/r/791364 (owner: 10Giuseppe Lavagetto) [14:25:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:25:56] back [14:26:18] (still pending change from jbond but please feel free to merge mine) [14:26:52] (03PS1) 10Ottomata: Install sasl packages needed to authenticate with kerberos on airflow nodes [puppet] - 10https://gerrit.wikimedia.org/r/791612 [14:27:33] (03PS1) 10Filippo Giunchedi: prometheus: refactor prometheus-node-exim-queue [puppet] - 10https://gerrit.wikimedia.org/r/791613 (https://phabricator.wikimedia.org/T305847) [14:27:37] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.280 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:31:28] (03CR) 10Ottomata: [C: 03+2] Install sasl packages needed to authenticate with kerberos on airflow nodes [puppet] - 10https://gerrit.wikimedia.org/r/791612 (owner: 10Ottomata) [14:32:46] jbond: cfssl puppet changes okay to merge? [14:33:27] looks like just comments, i'm going to merge [14:34:13] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:34:24] ottomata: thanks! [14:34:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Raise an error if wrong tags are used in a query. [software/conftool] - 10https://gerrit.wikimedia.org/r/790980 (https://phabricator.wikimedia.org/T308100) (owner: 10Giuseppe Lavagetto) [14:35:45] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:36:34] (03Merged) 10jenkins-bot: Raise an error if wrong tags are used in a query. [software/conftool] - 10https://gerrit.wikimedia.org/r/790980 (https://phabricator.wikimedia.org/T308100) (owner: 10Giuseppe Lavagetto) [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:41:03] (03PS1) 10Filippo Giunchedi: Export exim queue length from mx and lists [puppet] - 10https://gerrit.wikimedia.org/r/791615 (https://phabricator.wikimedia.org/T305847) [14:41:07] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:43:03] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:50:32] (03PS2) 10Ladsgroup: auto_schema: Make alter non-blocking on master of primary dc [software] - 10https://gerrit.wikimedia.org/r/791297 [14:53:47] (03CR) 10Ladsgroup: auto_schema: Make alter non-blocking on master of primary dc (031 comment) [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:11:15] 10ops-eqiad: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10ayounsi) [15:12:52] 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Marostegui) [15:13:21] 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Marostegui) Please let me know when you'd like to get the database depooled and powered off. [15:13:49] (03CR) 10Ahmon Dancy: [V: 03+1 C: 03+1] "Worked for me." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [15:20:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New version 2.2.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/791365 (owner: 10Giuseppe Lavagetto) [15:20:58] 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10MoritzMuehlenhoff) No problem, with a few days of advance warning to drain the node we can easily move ganeti1020 any time. [15:24:21] (03Merged) 10jenkins-bot: New version 2.2.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/791365 (owner: 10Giuseppe Lavagetto) [15:31:25] ottomata: sorry i completly missed this, thanks <3 [15:32:21] (03PS2) 10Jcrespo: prometheus: Avoid warnings on the mysqld exporter config generator [puppet] - 10https://gerrit.wikimedia.org/r/791580 [15:32:33] (03CR) 10Jcrespo: prometheus: Avoid warnings on the mysqld exporter config generator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791580 (owner: 10Jcrespo) [15:34:03] (03PS3) 10Jcrespo: prometheus: Avoid warnings on the mysqld exporter config generator [puppet] - 10https://gerrit.wikimedia.org/r/791580 [15:37:27] (03CR) 10Filippo Giunchedi: "This is deployed on toolforge mail relay, hence engaging WMCS folks (please add/remove reviewers as you see fit!)" [puppet] - 10https://gerrit.wikimedia.org/r/791613 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [15:38:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/791580 (owner: 10Jcrespo) [15:40:58] (03CR) 10Jcrespo: [C: 03+2] prometheus: Avoid warnings on the mysqld exporter config generator [puppet] - 10https://gerrit.wikimedia.org/r/791580 (owner: 10Jcrespo) [15:44:11] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:45:45] 10ops-eqiad: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10ayounsi) [15:46:30] 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) Please let me know when you'd like to get the databases and dbproxy depooled and powered off. [15:51:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:56:46] (03CR) 10Ahmon Dancy: [C: 04-1] "Just some minor text nits." [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [15:57:06] <_joe_> !log uploading conftool 2.2.0 to buster, bullseye T305824 T305582 T305607 T305638 T307905 T308100 [15:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:19] T305607: Support NOT in the dsl grammar - https://phabricator.wikimedia.org/T305607 [15:57:19] T305638: requestctl should have a "find actions using pattern foo" feature - https://phabricator.wikimedia.org/T305638 [15:57:19] T308100: Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 [15:57:20] T305582: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 [15:57:20] T307905: Add validate function to conftool reqconfig extension - https://phabricator.wikimedia.org/T307905 [15:57:20] T305824: Provide a meaningful Retry-After value - https://phabricator.wikimedia.org/T305824 [15:59:27] (03CR) 10Ahmon Dancy: [C: 04-1] C:helm: make the group permissions on helm_cache configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [16:11:13] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7927197, @elukey wrote: > Ack I think that it makes sense. Just to understand the mapping, at the mom... [16:18:32] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) The 50% bump in capacity didn't make any noticeable difference this time around. :-( [16:27:22] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) >>! In T306181#7927769, @akosiaris wrote: > The 50% bump in capacity didn't make any... [16:31:45] (03CR) 10Marostegui: "but if you go for None as replicas and: --include-masters would that still try to go for the primary DC?" [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [16:32:47] 10SRE, 10SRE-swift-storage, 10ops-eqiad: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) They asked for a screenshot of the network tab in ILO. I also attached here {F35138640} [16:36:07] 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) Thanks! Working on next FY budget, I noticed that FPC3 (MPC4E-3D-32XGE-SFPP, 2016) still have a few years before being due for a refresh. We do have to migrate links away from FPC4 though (2012).... [16:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:56:53] (03CR) 10MewOphaswongse: [C: 03+1] Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [16:57:32] (03CR) 10Dzahn: [C: 04-1] "I think there are missing "v"s in the URLs it tries to curl from github" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [16:58:12] (03PS1) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) [16:59:00] (03CR) 10Dzahn: "no, my bad when testing, nevermind" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [16:59:15] (03CR) 10Dzahn: [C: 03+2] buildkitd: Provide buildkitd image for trusted GitLab runners [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [17:02:13] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:28] (03CR) 10Dzahn: [V: 03+2 C: 03+2] buildkitd: Provide buildkitd image for trusted GitLab runners [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/791427 (https://phabricator.wikimedia.org/T308271) (owner: 10Dduvall) [17:18:27] (03CR) 10Hashar: "recheck" [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [17:19:02] (03CR) 10jerkins-bot: [V: 04-1] Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [17:21:17] (03CR) 10Hashar: "I have no clue why CI triggers SonarQube :D" [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [17:22:38] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) Hey @Dzahn I heard back from Advancement and they're ready to move on this. They have a ma... [17:24:16] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1004.wikimedia.org [17:24:17] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudservices1004.wikimedia.org [17:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:25] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1004.wikimedia.org [17:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:30] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices1004.wikimedia.org [17:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:18] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1003.wikimedia.org [17:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:01] PROBLEM - Check systemd state on an-worker1125 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:28] (03PS1) 10Hashar: Switch from extension to plugin API [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791647 [17:34:09] (03CR) 10jerkins-bot: [V: 04-1] Switch from extension to plugin API [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791647 (owner: 10Hashar) [17:35:15] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team: Add Aiko and Kevin to the deployment posix group - https://phabricator.wikimedia.org/T308308 (10RLazarus) 05Openβ†’03Stalled a:03elukey >>! In T308308#7926592, @elukey wrote: > I may have created this task too soon, some discussion on T305729 is stil... [17:37:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices1003.wikimedia.org [17:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:27] 10SRE-Access-Requests, 10GitLab (CI & Job Runners), 10User-brennen: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10brennen) [17:44:49] (03PS2) 10Hashar: Switch from extension to plugin API [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791647 [17:45:22] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team: Add Aiko and Kevin to the deployment posix group - https://phabricator.wikimedia.org/T308308 (10RLazarus) p:05Triageβ†’03Medium [17:45:30] (03CR) 10jerkins-bot: [V: 04-1] Switch from extension to plugin API [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791647 (owner: 10Hashar) [17:46:43] RECOVERY - Check systemd state on an-worker1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:47:01] (03PS3) 10Hashar: Switch from extension to plugin API [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791647 [17:47:35] (03CR) 10jerkins-bot: [V: 04-1] Switch from extension to plugin API [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791647 (owner: 10Hashar) [18:04:41] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:17:42] 10SRE-Access-Requests, 10serviceops, 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-brennen: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10brennen) [18:21:01] (03CR) 10GergΕ‘ Tisza: [C: 03+1] "This patch is blocking Iea95a7e56415657dddef4a25f6159527d18aeb03 from merging. What's the deployment plan for that? If it's to deploy this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (https://phabricator.wikimedia.org/T305443) (owner: 10Sergio Gimeno) [18:23:03] 10SRE-Access-Requests, 10serviceops, 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-brennen: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10thcipriani) [18:23:44] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) Hi @bcampbell Thanks for the update! sounds good to me. Do they have a specific time set for t... [18:24:26] 10SRE-Access-Requests, 10serviceops, 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-brennen: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10thcipriani) Sounds good from from my side: seems... [18:27:47] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "matches https://netbox.wikimedia.org/ipam/ip-addresses/10940/ & https://netbox.wikimedia.org/ipam/ip-addresses/10941/" [dns] - 10https://gerrit.wikimedia.org/r/791608 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [18:28:47] 10SRE-Access-Requests, 10serviceops, 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-brennen: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10RLazarus) We're past the European work day, so I... [18:28:59] 10SRE-Access-Requests, 10serviceops, 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-brennen: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10RLazarus) p:05Triageβ†’03Medium [18:35:20] (03PS1) 10Dduvall: WIP: Optionally provide buildkitd to GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/791655 (https://phabricator.wikimedia.org/T308271) [18:35:45] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Dzahn) regarding `scandium`: That just needs a heads up to @ssastry when the move is planned to happen. nothing much from my side here (if it just comes back as before). thanks! [18:36:11] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 3 others: Access to trusted gitlab runners for gitlab-roots (or appropriate similar group) - https://phabricator.wikimedia.org/T308350 (10RLazarus) Hmm, also: As a group access change, this should be reviewed and approved in the... [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:46:01] RECOVERY - SSH on wtp1037.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:48:30] (03CR) 10Krinkle: [C: 04-1] "Commit message is identical to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/791302 but does something different. Wrong c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791303 (https://phabricator.wikimedia.org/T308209) (owner: 10Kosta Harlan) [18:48:34] (03CR) 10Krinkle: [C: 03+1] GrowthExperiments: Remove unused GEHomepageSuggestedEditsRequiresOptIn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791302 (https://phabricator.wikimedia.org/T308208) (owner: 10Kosta Harlan) [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:04:24] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Dzahn) Oh wait, does moving racks and running that cookbook mean IP addresses will change? [19:05:29] mutante: pretty sure that's a no [19:05:50] 'As the hosts stay in the same vlan' being why [19:05:51] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:06:17] RhinosF1: oh, did it say that? I missed it. thanks [19:06:28] I see [19:06:36] mutante: in the description before the steps [19:07:02] yep, ty [19:12:57] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:05] (03CR) 10Dzahn: [C: 03+1] gitlab: use gitlab1003 as replia/passive host [puppet] - 10https://gerrit.wikimedia.org/r/791599 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [19:18:03] (03CR) 10Dzahn: [C: 03+2] git: add define for abritrarily named config file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791327 (https://phabricator.wikimedia.org/T307620) (owner: 10Hashar) [19:19:09] (03PS1) 10Eevans: WIP: enable cassandra encryption (aqs cluster) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) [19:19:13] (03PS2) 10Dzahn: etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) [19:21:13] (03CR) 10Eevans: [C: 04-1] "This cannot be merged until the keys & certs have been created." [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [19:21:49] (03CR) 10jerkins-bot: [V: 04-1] etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [19:29:32] (03CR) 10Eevans: [C: 04-1] "Fails (as expected) for the AQS cluster (no keys & certs). Ok for RESTBase & Sessionstore clusters." [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [19:33:33] (03PS1) 10Andrew Bogott: nova_fullstack_test: tag each log message with the affected VM [puppet] - 10https://gerrit.wikimedia.org/r/791665 [19:34:36] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10AlexisJazz) >>! In T306181#7927825, @BTullis wrote: >>>! In T306181#7927769, @akosiaris wrote... [19:35:35] (03PS2) 10Andrew Bogott: nova_fullstack_test: tag each log message with the affected VM [puppet] - 10https://gerrit.wikimedia.org/r/791665 [19:35:58] (03PS3) 10Andrew Bogott: nova_fullstack_test: tag each log message with the affected VM [puppet] - 10https://gerrit.wikimedia.org/r/791665 [19:37:57] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:04] (03CR) 10Andrew Bogott: [C: 03+2] nova_fullstack_test: tag each log message with the affected VM [puppet] - 10https://gerrit.wikimedia.org/r/791665 (owner: 10Andrew Bogott) [19:38:58] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/pcc-worker1002/35260/conf1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [19:39:30] (03CR) 10Dzahn: "but maybe this is moot because it should be replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/790657 ?" [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [19:39:46] (03PS1) 10Bking: elastic: remove decommissioned hosts in beta [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) [19:41:06] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [19:44:49] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:00] (03CR) 10Dzahn: "# [*use_cergen*]" [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [19:48:37] (03PS1) 10Eevans: Fake keys and certificates for cassandra (aqs) [labs/private] - 10https://gerrit.wikimedia.org/r/791667 (https://phabricator.wikimedia.org/T307798) [19:49:15] (03PS3) 10Dzahn: etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) [19:49:58] (03PS4) 10Dzahn: etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) [19:51:45] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) That makes sense @Dzahn. I'll check in again on this task on the 16th. Advancement told m... [19:51:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:52:34] (03CR) 10jerkins-bot: [V: 04-1] etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [19:53:50] (03CR) 10Hnowlan: New service: image-suggestion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [19:54:21] (03PS5) 10Dzahn: etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) [19:56:49] (03CR) 10jerkins-bot: [V: 04-1] etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [19:56:54] (03PS2) 10Hnowlan: Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) [19:57:08] (03CR) 10Hnowlan: Add helmfile configuration for image-suggestion (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [19:59:05] (03CR) 10jerkins-bot: [V: 04-1] Add helmfile configuration for image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/791324 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [20:01:12] (03PS1) 10Andrew Bogott: nova_fullstack_test: abuse the cloud.instance.name field to hold the test VM [puppet] - 10https://gerrit.wikimedia.org/r/791668 [20:01:58] (03PS3) 10Dzahn: P:etcd::tlsproxy: move to cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [20:02:20] (03CR) 10jerkins-bot: [V: 04-1] P:etcd::tlsproxy: move to cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [20:03:46] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/35262/conf1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [20:03:59] (03PS4) 10Dzahn: P:etcd::tlsproxy: move to cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [20:04:31] (03CR) 10Dzahn: "can I still do https://gerrit.wikimedia.org/r/c/operations/puppet/+/788437 first and then we talk about this switch next?" [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [20:05:27] 10SRE, 10serviceops: Renew puppet cert for etcd.codfw.wmnet - https://phabricator.wikimedia.org/T302153 (10Dzahn) [20:05:31] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Dzahn) 05Openβ†’03In progress [20:07:31] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:21] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:54] 10SRE-swift-storage, 10Arc-Lamp, 10Performance-Team, 10Patch-For-Review: Swift container for performance flame graphs (ArcLamp) - https://phabricator.wikimedia.org/T244776 (10Krinkle) >>! In T244776#6475486, @gerritbot wrote: > Change 626241 **merged** by jenkins-bot: > [performance/arc-lamp@master] Genera... [20:17:50] (03PS6) 10Dzahn: etcd::tlsproxy: set use_cergen to true [puppet] - 10https://gerrit.wikimedia.org/r/788437 (https://phabricator.wikimedia.org/T307382) [20:21:49] (03PS1) 10Dzahn: delete expired certs etcd.eqiad.wmnet.crt and etcd.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/791671 (https://phabricator.wikimedia.org/T307382) [20:22:39] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [20:24:46] (03CR) 10Dzahn: "let me rebase that so it's actually _not_ on top of https://gerrit.wikimedia.org/r/c/operations/puppet/+/788437" [puppet] - 10https://gerrit.wikimedia.org/r/791671 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [20:24:51] (03PS2) 10Dzahn: delete expired certs etcd.eqiad.wmnet.crt and etcd.codfw.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/791671 (https://phabricator.wikimedia.org/T307382) [20:24:57] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Marostegui) As far as I know IPs won't change [20:26:21] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/35264/conf1005.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/791671 (https://phabricator.wikimedia.org/T307382) (owner: 10Dzahn) [20:33:57] (03CR) 10BryanDavis: "btullis: I'm not a root, so this is all yours to merge and deploy." [puppet] - 10https://gerrit.wikimedia.org/r/786382 (owner: 10BryanDavis) [20:37:08] bd808: I think b.tullis is off for the day, do you need to get that in sooner than Monday? [20:37:09] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:29] number_of_sres_tricked_into_believing_bd808_is_root++; [20:37:40] rzl: no, but thanks for checking. :) [20:37:42] (03PS1) 10Dzahn: delete expired globalsign-2018/2019 certs. [puppet] - 10https://gerrit.wikimedia.org/r/791673 [20:37:43] πŸ‘ [20:39:10] the license plate on my jeep is SUDO, but that still doesn't work in ops/puppet.git ;) [20:39:58] well that's unavoidable, you can't run it without a driver [20:41:08] (03PS1) 10Dzahn: delete expired ldap-labs certificates [puppet] - 10https://gerrit.wikimedia.org/r/791674 [20:42:04] (03PS2) 10Bking: elastic: remove decommissioned hosts in beta [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) [20:42:23] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/791666 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:43:59] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:35] (03PS1) 10Dzahn: delete expired ldap-corp certificates [puppet] - 10https://gerrit.wikimedia.org/r/791677 [20:49:41] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Gehel) [20:53:10] (03PS1) 10Dzahn: delete expired digicert certs [puppet] - 10https://gerrit.wikimedia.org/r/791678 [20:53:55] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [20:56:25] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [21:07:35] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [21:12:00] 10SRE, 10ops-eqiad, 10DBA: db1164 power supply isn't redundant - https://phabricator.wikimedia.org/T308246 (10wiki_willy) a:03Jclark-ctr Hi @Jclark-ctr - can you check this out, since @Cmjohnson will be on vacation? Thanks, Willy [21:13:43] 10SRE, 10ops-eqsin: cr3-eqsin:xe-0/1/1 interface errors - https://phabricator.wikimedia.org/T300485 (10wiki_willy) a:03RobH [21:15:55] (03PS1) 10Dzahn: doc: add monitoring of doc.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/791684 [21:16:39] (03CR) 10jerkins-bot: [V: 04-1] doc: add monitoring of doc.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/791684 (owner: 10Dzahn) [21:22:09] (03PS2) 10Dzahn: doc: add monitoring of doc.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/791684 [21:32:00] ACKNOWLEDGEMENT - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service Ryan Kemper search dev experimenting on host failures expected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:36:15] testing new highlight for wcqs [21:38:31] RECOVERY - Check systemd state on wcqs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:20] wcqs [21:42:19] well, that was a dismal failure. ryankemper or anyone else, do you mind typing 'wcqs' into a message? Maybe it doesn't highlight if I type it myself [21:42:27] this line's got wcqs in it [21:42:36] this one says wcqs1001 [21:43:25] excellent rzl ! The first works, but not the second. Let me see if * works [21:43:39] (03PS4) 10Hashar: Switch from extension to plugin API [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791647 [21:44:10] OK, try typing wcqs1001 again por favor? [21:44:13] (03CR) 10jerkins-bot: [V: 04-1] Switch from extension to plugin API [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791647 (owner: 10Hashar) [21:44:23] wcqs1001 again [21:45:29] Nope! I'll hit up my clients' tech support, they're usually pretty good [21:46:00] Maybe it takes regex syntax? Like .* [21:46:07] good luck! worst case you can just add 9,000 highlights for wcqs1000 through wcqs9999 and then you'll be all set [21:47:22] ...until we get wcqs10001 ;P [21:47:44] sounds fake [21:49:03] (03PS1) 10Hashar: Add SonarQube scanner [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791692 [21:49:53] PROBLEM - SSH on wtp1037.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:50:56] (03CR) 10jerkins-bot: [V: 04-1] Add SonarQube scanner [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791692 (owner: 10Hashar) [21:51:36] when you start using 5 digit host name numbers CI will V-1 you :) [21:51:44] (? in "typos" file in the root of the puppet repo. heh [21:55:23] LOL, always be ready [21:55:49] OK, someone send "wcqs1000" one more time if you don't mind... [21:56:01] wcqs1001 one more time [21:56:07] and wcqs1001.eqiad.wmnet for good measure [21:56:15] ding ding ding, it's working [21:56:17] \o/ [21:56:34] {β—• β—‘ β—•} [21:56:39] Thanks again rzl ! [21:57:11] mutante: [^1].*eqiad is still there that is great [21:58:00] sure thing! [21:59:05] wCqS (test) [21:59:22] It works! [22:00:01] :) we can also change the notification command for icinga in the first place, btw [22:00:09] Confusingly, Textual (my IRC client) has general notifications (which support regex) and server-specific notifications (which don't) [22:00:24] like "if contact is wcqs team then .. run whatever command.. like send email or whatnot" [22:01:48] that works too! [22:07:11] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:54:30] (03CR) 10Brennen Bearnes: gitlab runner: restrict docker images and services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [23:00:13] RECOVERY - MegaRAID on an-worker1081 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:13:53] !log razzi@deploy1002 Started deploy [analytics/turnilo/deploy@bf60521]: Staging deployment of turnilo 1.35 [23:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:02] !log razzi@deploy1002 Finished deploy [analytics/turnilo/deploy@bf60521]: Staging deployment of turnilo 1.35 (duration: 00m 08s) [23:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:53] PROBLEM - MegaRAID on an-worker1081 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [23:40:42] (03CR) 10Jforrester: [C: 04-1] "Per Majavah" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747494 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah) [23:42:13] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-tool1007.eqiad.wmnet with reason: Upgrade turnilo [23:42:14] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-tool1007.eqiad.wmnet with reason: Upgrade turnilo [23:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:05] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The following units failed: wdqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:52:28] (03CR) 10Jforrester: "This should ship on or after 23 May." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788385 (https://phabricator.wikimedia.org/T248418) (owner: 10Jforrester) [23:53:17] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook