[00:00:04] RoanKattouw and Urbanecm: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211117T0000). [00:00:04] No Gerrit patches in the queue for this window AFAICS. [00:00:29] PROBLEM - Memcached on thumbor1006 is CRITICAL: connect to address 10.64.32.149 and port 11211: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [00:02:32] legoktm ^ that uh... normal? [00:02:46] no, I'm debugging it now [00:02:51] but 1006 is not in service yet [00:03:01] ah okay, cool [00:03:45] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.88 ms [00:06:57] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor1006.eqiad.wmnet [00:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:50] there's some race condition in start up I think, I had to restart haproxy for it to start working [00:07:59] and I needed to reboot it anyways [00:10:17] RECOVERY - Memcached on thumbor1006 is OK: TCP OK - 0.000 second response time on 10.64.32.149 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [00:10:39] fastest ping ever :D [00:12:46] (03CR) 10Ladsgroup: [C: 03+1] acme_chief: convert cron to restart service to timer [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [00:15:13] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1006.eqiad.wmnet [00:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:18] !log T276198 `ryankemper@cumin1001:~$ sudo cumin -b 3 '*elastic*' 'sudo run-puppet-agent --force'` Change looks good (no complaints from systemd), rolling out to rest of fleet / reenabling puppet [00:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:21] T276198: /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 [00:24:46] (03CR) 10Legoktm: mediawiki: Ensure mwdeploy user is a member of the www-data group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) (owner: 10Ahmon Dancy) [00:27:51] (03CR) 10Ahmon Dancy: mediawiki: Ensure mwdeploy user is a member of the www-data group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) (owner: 10Ahmon Dancy) [00:34:38] (03PS1) 10Ryan Kemper: elasticsearch: disallow puppet to restart [puppet] - 10https://gerrit.wikimedia.org/r/739379 (https://phabricator.wikimedia.org/T290902) [00:37:25] (03PS2) 10Ryan Kemper: elasticsearch: disallow puppet to restart [puppet] - 10https://gerrit.wikimedia.org/r/739379 (https://phabricator.wikimedia.org/T290902) [00:40:04] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739379 (https://phabricator.wikimedia.org/T290902) (owner: 10Ryan Kemper) [00:40:53] (03CR) 10Ryan Kemper: "I'm not aware of any case where we want puppet to be able to automatically restart the elasticsearch services, so this approach will rende" [puppet] - 10https://gerrit.wikimedia.org/r/739379 (https://phabricator.wikimedia.org/T290902) (owner: 10Ryan Kemper) [00:49:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:04:17] (03CR) 10Ryan Kemper: elasticsearch: disallow puppet to restart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739379 (https://phabricator.wikimedia.org/T290902) (owner: 10Ryan Kemper) [01:40:56] (03CR) 10jerkins-bot: [V: 04-1] gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [01:44:03] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet - https://phabricator.wikimedia.org/T295563 (10Papaul) @MatthewVernon thank you for getting all this info on the task I checked the server today i didn't see any failed disk. I will need mor... [01:46:19] (03CR) 10Brennen Bearnes: "Unclear why this is failing" [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [01:46:57] (03CR) 10Brennen Bearnes: gitlab-runner: restrict docker images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [01:49:32] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10Papaul) [01:52:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul) [02:04:52] 10SRE, 10Discovery-Search (Current work): /var/run/elasticsearch deleted by elasticsearch - https://phabricator.wikimedia.org/T276198 (10RKemper) We can probably move this to `Needs Reporting` but I'll check with @dcausse and others in our Search team Wednesday meeting. For now I'll stick this in Waiting while... [03:25:26] (03PS1) 10KartikMistry: Enable more languages for Section Translation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739391 (https://phabricator.wikimedia.org/T294223) [03:35:59] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=citoid.svc.eqiad.wmnet, port=4003): Read timed out. (read timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Citoid [03:37:53] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:44:09] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [03:50:19] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [04:13:21] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudcephmon1001, stat1008, stat1005, cloudcephmon1003, cloudcephmon1002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:50:35] (03CR) 10Legoktm: python39: Use shell reimplementation of webservice-runner (033 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [05:50:38] (03PS3) 10Legoktm: python39: Use shell reimplementation of webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) [05:54:39] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudcephmon1002, stat1005, cloudcephmon1001, cloudcephmon1003, stat1008 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [06:04:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove logpager from s5 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17742 and previous config saved to /var/cache/conftool/dbconfig/20211117-060426-marostegui.json [06:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:30] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [06:30:49] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/739401 [06:31:59] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/739401 (owner: 10Marostegui) [06:33:27] !log Upgrade clouddb1018 [06:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:53] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1018" [puppet] - 10https://gerrit.wikimedia.org/r/739294 [06:35:39] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1018" [puppet] - 10https://gerrit.wikimedia.org/r/739294 (owner: 10Marostegui) [06:38:14] !log start of deleting auto-review logs in arwiki (T285608) deleting 23M rows [06:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:17] T285608: Stop logging and clean up auto review logs - https://phabricator.wikimedia.org/T285608 [06:43:29] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6008.drmrs.wmnet with OS buster [06:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:39] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6008.drmrs.wmnet with OS buster [06:52:49] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6009.drmrs.wmnet with OS buster [06:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:59] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6009.drmrs.wmnet with OS buster [06:57:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 for upgrade', diff saved to https://phabricator.wikimedia.org/P17743 and previous config saved to /var/cache/conftool/dbconfig/20211117-065740-marostegui.json [06:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:41] !log Upgrade db1180 to 10.4.22 [06:58:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 1%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17744 and previous config saved to /var/cache/conftool/dbconfig/20211117-070055-root.json [07:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:45] marostegui: btw, https://sal.toolforge.org/log/3hyeLH0B8Fs0LHO5CQY9 let me know if you see anything with s7, so far it's good [07:07:39] (03PS1) 10Majavah: kubeadm: raise default to 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/739402 [07:07:41] (03PS1) 10Majavah: aptrepo: drop k8s 1.19 repos [puppet] - 10https://gerrit.wikimedia.org/r/739403 [07:15:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 5%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17745 and previous config saved to /var/cache/conftool/dbconfig/20211117-071559-root.json [07:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:11] 10Puppet, 10Infrastructure-Foundations, 10Wikidata, 10wdwb-tech, 10User-Ladsgroup: Migrate wikibase-dispatch-changes crons to systemd timers - https://phabricator.wikimedia.org/T288175 (10Ladsgroup) >>! In T288175#7289164, @Ladsgroup wrote: > Maybe after a while we should delete the old logs, I can put a... [07:20:18] !log start of clean up of autreview logs of ruwiki, deleting 3.5M rows (T285608) [07:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:22] T285608: Stop logging and clean up auto review logs - https://phabricator.wikimedia.org/T285608 [07:22:55] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6008.drmrs.wmnet with OS buster [07:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:07] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6008.drmrs.wmnet with OS buster completed: - cp6008 (**WARN**)... [07:28:15] !log `sudo pkill -U jmixter` on stat100[5,8] to allow puppet to run and remove the offboarded user [07:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:05] !log `apt-get clean` on an-tool1005 to free space in the root partition [07:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:02] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6010.drmrs.wmnet with OS buster [07:31:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17746 and previous config saved to /var/cache/conftool/dbconfig/20211117-073102-root.json [07:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:11] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6010.drmrs.wmnet with OS buster [07:34:25] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6009.drmrs.wmnet with OS buster [07:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:34] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6009.drmrs.wmnet with OS buster completed: - cp6009 (**WARN**)... [07:39:32] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:41:44] 10SRE, 10Analytics, 10LDAP-Access-Requests: LDAP access to the wmf group for Brooke Camarda & Olga Spingou (superset, turnilo, hue) - https://phabricator.wikimedia.org/T295828 (10Peachey88) @CGlenn I would recommend filing separate requests for each team member using the template from here: https://phabricat... [07:44:32] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6011.drmrs.wmnet with OS buster [07:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:41] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6011.drmrs.wmnet with OS buster [07:45:35] (03CR) 10Gehel: "LGTM in principle. See the minor comment inline." [puppet] - 10https://gerrit.wikimedia.org/r/739379 (https://phabricator.wikimedia.org/T290902) (owner: 10Ryan Kemper) [07:46:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 20%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17747 and previous config saved to /var/cache/conftool/dbconfig/20211117-074606-root.json [07:46:07] (03PS9) 10Elukey: Move coal, navtiming and statsv to the new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) [07:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:03] (03CR) 10Elukey: [C: 03+2] Move coal, navtiming and statsv to the new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:49:21] !log restart coal, navtiming, statsv (refreshed by puppet) after https://gerrit.wikimedia.org/r/737970 [07:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:55] Krinkle: --^ o/ as FYI the above have been restarted, all good afaics, let me know if anything looks out of the ordinary [07:55:00] (03PS1) 10Elukey: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) [08:01:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17748 and previous config saved to /var/cache/conftool/dbconfig/20211117-080110-root.json [08:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:35] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6010.drmrs.wmnet with OS buster [08:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:46] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6010.drmrs.wmnet with OS buster completed: - cp6010 (**WARN**)... [08:14:52] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6012.drmrs.wmnet with OS buster [08:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:01] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6012.drmrs.wmnet with OS buster [08:16:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 40%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17749 and previous config saved to /var/cache/conftool/dbconfig/20211117-081613-root.json [08:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:23] (03PS5) 10Giuseppe Lavagetto: sslcert::ca_deselect_dstx3 for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [08:21:59] (03CR) 10Muehlenhoff: [C: 03+1] admin: add Julia Kieserman to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/739371 (https://phabricator.wikimedia.org/T295693) (owner: 10Dzahn) [08:24:19] (03CR) 10Nikerabbit: [C: 03+1] Enable more languages for Section Translation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739391 (https://phabricator.wikimedia.org/T294223) (owner: 10KartikMistry) [08:24:29] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6011.drmrs.wmnet with OS buster [08:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:39] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6011.drmrs.wmnet with OS buster completed: - cp6011 (**WARN**)... [08:30:19] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6013.drmrs.wmnet with OS buster [08:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:28] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6013.drmrs.wmnet with OS buster [08:30:48] (03CR) 10Thiemo Kreuz (WMDE): Remove unused code from StaticSiteConfiguration class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858 (owner: 10Thiemo Kreuz (WMDE)) [08:31:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17750 and previous config saved to /var/cache/conftool/dbconfig/20211117-083117-root.json [08:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:34] (03CR) 10Muehlenhoff: "One comment inline. And what Andrew said; let's hold merging this until next week, until yesterday's LDAP changes has settled a but." [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [08:34:15] (03PS6) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 [08:34:17] (03PS8) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 [08:34:22] (03PS1) 10Samwilson: Enable disambiguator notifications on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739467 (https://phabricator.wikimedia.org/T293319) [08:40:17] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [08:42:52] (03PS4) 10JMeybohm: Add an update-ca-certificates hook maintaining wmf-ca-certificates.crt [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737884 [08:42:54] (03PS4) 10JMeybohm: Install update-ca-certificates hook maintaining wmf-ca-certificates.crt [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737885 [08:45:11] (03PS1) 10Muehlenhoff: Enable ganeti216 for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/739469 [08:46:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17751 and previous config saved to /var/cache/conftool/dbconfig/20211117-084621-root.json [08:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:39] (03CR) 10David Caro: acme_chief: convert cron to restart service to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:46:50] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add an update-ca-certificates hook maintaining wmf-ca-certificates.crt [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737884 (owner: 10JMeybohm) [08:46:57] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Install update-ca-certificates hook maintaining wmf-ca-certificates.crt [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737885 (owner: 10JMeybohm) [08:47:21] (03CR) 10David Caro: "btw. thanks for the patch 👍" [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [08:50:30] (03CR) 10Giuseppe Lavagetto: Auto add helm chart repositories (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 (owner: 10JMeybohm) [08:51:37] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10ayounsi) a:03RobH `mgmt` ports to the `mgmt` switch please :) Once we have this and console, we can check and upgrade them. [08:54:16] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6012.drmrs.wmnet with OS buster [08:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:25] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6012.drmrs.wmnet with OS buster completed: - cp6012 (**WARN**)... [08:56:23] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6014.drmrs.wmnet with OS buster [08:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:33] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6014.drmrs.wmnet with OS buster [09:00:24] (03CR) 10Muehlenhoff: [C: 03+2] Enable ganeti216 for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/739469 (owner: 10Muehlenhoff) [09:01:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P17752 and previous config saved to /var/cache/conftool/dbconfig/20211117-090124-root.json [09:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:13] !log installing ffmpeg security updates on stretch [09:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:11] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove PHP 7.3 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/739006 (owner: 10Legoktm) [09:04:32] (03PS7) 10JMeybohm: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 [09:04:34] (03PS9) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 [09:06:12] (03PS1) 10Jcrespo: dbbackups: Further reorganize backup location to optimize for total latency [puppet] - 10https://gerrit.wikimedia.org/r/739471 (https://phabricator.wikimedia.org/T138562) [09:11:09] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6013.drmrs.wmnet with OS buster [09:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:14] (03PS6) 10Giuseppe Lavagetto: php-fpm: Add settings to control debuggability [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 (owner: 10Ahmon Dancy) [09:11:17] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6013.drmrs.wmnet with OS buster completed: - cp6013 (**WARN**)... [09:13:36] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6015.drmrs.wmnet with OS buster [09:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:45] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6015.drmrs.wmnet with OS buster [09:16:42] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] php-fpm: Add settings to control debuggability [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/732737 (owner: 10Ahmon Dancy) [09:17:53] (03PS2) 10Jcrespo: dbbackups: Further reorganize backup location to optimize for total latency [puppet] - 10https://gerrit.wikimedia.org/r/739471 (https://phabricator.wikimedia.org/T138562) [09:19:24] <_joe_> !log removing php 7.3 images from docker-registry.wikimedia.org [09:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:57] (03PS1) 10Giuseppe Lavagetto: correctly bump version in changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/739472 [09:23:59] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Auto add helm chart repositories (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 (owner: 10JMeybohm) [09:24:05] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [09:24:14] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] correctly bump version in changelog [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/739472 (owner: 10Giuseppe Lavagetto) [09:28:09] (03CR) 10Ayounsi: [C: 03+2] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/738372 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [09:28:29] (03Merged) 10jenkins-bot: Auto add helm chart repositories [deployment-charts] - 10https://gerrit.wikimedia.org/r/739122 (owner: 10JMeybohm) [09:28:31] (03CR) 10jerkins-bot: [V: 04-1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [09:28:49] (03Merged) 10jenkins-bot: Add drmrs switches to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/738372 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [09:30:06] (03PS10) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 [09:31:31] (03PS6) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [09:31:33] (03PS1) 10Arturo Borrero Gonzalez: cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) [09:32:50] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:33:07] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10Rosalie_WMDE) [09:34:48] (03Abandoned) 10Arturo Borrero Gonzalez: cloud: ceph: libvirt: migrate to new ceph auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/739235 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:34:56] 10SRE, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Jelto) [09:35:28] 10SRE, 10Tracking-Neverending: Cronspam from acmechief-test1001 - https://phabricator.wikimedia.org/T295770 (10Jelto) 05Resolved→03Open @Vgutierrez fyi: there are some more mails from `root@acmechief-test1001`. [09:35:49] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6014.drmrs.wmnet with OS buster [09:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:58] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6014.drmrs.wmnet with OS buster completed: - cp6014 (**WARN**)... [09:36:20] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Jelto) Feedback from Carol via mail: > Carol Dunn 2:13 AM (8 hours ago) to me > Approved > Sent from my iPhone [09:36:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Jelto) [09:38:02] PROBLEM - DPKG on ganeti-test2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:39:36] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6016.drmrs.wmnet with OS buster [09:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:46] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6016.drmrs.wmnet with OS buster [09:42:09] (03CR) 10ZPapierski: [C: 03+1] query_service: Generalize prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/737484 (https://phabricator.wikimedia.org/T280008) (owner: 10Ebernhardson) [09:43:17] (03PS7) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [09:43:19] (03PS2) 10Arturo Borrero Gonzalez: cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) [09:44:11] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:44:16] (03PS1) 10Elukey: Deploy the wmf_trusted_cas.jks bundle where Gobblin runs [puppet] - 10https://gerrit.wikimedia.org/r/739476 (https://phabricator.wikimedia.org/T291905) [09:44:57] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:45:56] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32462/console" [puppet] - 10https://gerrit.wikimedia.org/r/739476 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [09:45:59] (03PS5) 10JMeybohm: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [09:47:34] (03PS8) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [09:47:36] (03PS3) 10Arturo Borrero Gonzalez: cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) [09:48:25] !log running "gnt-cluster renew-crypto --new-cluster-certificate" on ganeti test cluster [09:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:30] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:49:17] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:51:38] 10SRE, 10SRE-Access-Requests, 10WMF-NDA-Requests: Add EJoseph to #wmf-nda - https://phabricator.wikimedia.org/T293326 (10Jelto) p:05Triage→03Medium [09:52:07] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10Jelto) p:05Triage→03Medium [09:53:45] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6015.drmrs.wmnet with OS buster [09:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:54] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6015.drmrs.wmnet with OS buster completed: - cp6015 (**WARN**)... [09:55:37] (03CR) 10David Caro: "Can you run a PCC too?" [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:56:42] (03PS9) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [09:56:44] (03PS4) 10Arturo Borrero Gonzalez: cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) [09:56:46] (03PS1) 10Elukey: presto: move truststore to the new wmf internal CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/739477 [09:57:41] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [09:58:31] (03PS3) 10Jcrespo: dbbackups: Further reorganize backup location to optimize for total latency [puppet] - 10https://gerrit.wikimedia.org/r/739471 (https://phabricator.wikimedia.org/T138562) [09:58:57] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32464/console" [puppet] - 10https://gerrit.wikimedia.org/r/739477 (owner: 10Elukey) [09:59:50] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [09:59:50] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [09:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:36] !log running "gnt-cluster upgrade --to 2.16" on ganeti test cluster [10:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:23] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:01:23] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:46] (03CR) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:02:10] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Further reorganize backup location to optimize for total latency [puppet] - 10https://gerrit.wikimedia.org/r/739471 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:02:12] (03CR) 10Hashar: "Compiler https://puppet-compiler.wmflabs.org/compiler1003/1079/contint2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/738370 (https://phabricator.wikimedia.org/T187897) (owner: 10Hashar) [10:02:14] (03PS2) 10Elukey: presto: move truststore to the new wmf internal CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/739477 [10:03:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32466/console" [puppet] - 10https://gerrit.wikimedia.org/r/739477 (owner: 10Elukey) [10:05:18] (03PS1) 10Cathal Mooney: Depool ulsfo to allow for safe reconfig of CR routers there [dns] - 10https://gerrit.wikimedia.org/r/739479 (https://phabricator.wikimedia.org/T295672) [10:05:41] (03CR) 10David Caro: "Just noting that this requires the secret keydata to be already in the private repo for both codfw1dev and eqiad1" [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:07:04] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) This will cause a hard downtime for 6 servers (rack [[ https://netbox.wikimedia.org/dcim/racks/57/ | B7 ]]), for up to 1h, but most likely less: (1) thanos-be2002... [10:08:04] (03CR) 10David Caro: openstack: nova: factorize libvirt secrets management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:08:09] (03CR) 10Ayounsi: [C: 03+1] Depool ulsfo to allow for safe reconfig of CR routers there [dns] - 10https://gerrit.wikimedia.org/r/739479 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [10:08:11] (03PS1) 10Jelto: admin: add ssh key for saisuman and add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/739481 (https://phabricator.wikimedia.org/T295552) [10:10:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T295552 (10Jelto) 05Open→03In progress p:05Triage→03Medium [10:10:40] (03CR) 10Cathal Mooney: [C: 03+2] Depool ulsfo to allow for safe reconfig of CR routers there [dns] - 10https://gerrit.wikimedia.org/r/739479 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [10:12:22] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10Jelto) pinged @thcipriani in irc to take a look here to progress with the access request [10:13:20] Heads up - I am de-pooling ulsfo in DNS to drain it of traffic before rolling out some changes to CR routers there (T295672). [10:13:21] T295672: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 [10:14:08] !log De-pool ulsfo in DNS to allow safe reconfiguration / test of changes to CR routers iBGP (T295672) [10:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:28] ack topranks [10:15:45] (03CR) 10Btullis: alertmanager: send releng alerts to both irc and mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738381 (https://phabricator.wikimedia.org/T292284) (owner: 10Hashar) [10:18:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2002.codfw.wmnet [10:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:58] (03PS2) 10Hashar: alertmanager: send releng alerts to both irc and mail [puppet] - 10https://gerrit.wikimedia.org/r/738381 (https://phabricator.wikimedia.org/T292284) [10:18:59] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6016.drmrs.wmnet with OS buster [10:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:09] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6016.drmrs.wmnet with OS buster completed: - cp6016 (**WARN**)... [10:19:11] (03CR) 10JMeybohm: "I've rebased against master with the updated CI. As I suspected, this change now looks way more invasive" [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [10:19:43] (03PS10) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [10:20:27] (03CR) 10Hashar: alertmanager: send releng alerts to both irc and mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738381 (https://phabricator.wikimedia.org/T292284) (owner: 10Hashar) [10:22:25] topranks: don't forget to run authdns-update :) [10:22:53] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10LSobanski) Adding @MatthewVernon for the Swift hosts. [10:22:54] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/32467/console" [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:23:04] yep... already done and hoping I've not broken everything :D [10:23:08] thanks. [10:24:18] excellent! [10:25:27] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10LSobanski) [10:26:07] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739481 (https://phabricator.wikimedia.org/T295552) (owner: 10Jelto) [10:27:06] (03CR) 10David Caro: openstack: nova: factorize libvirt secrets management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:28:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2002.codfw.wmnet [10:28:12] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Christoph Jauera - https://phabricator.wikimedia.org/T295781 (10Jelto) p:05Triage→03Medium [10:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:16] (03PS5) 10Arturo Borrero Gonzalez: cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) [10:29:58] (03CR) 10David Caro: [C: 03+1] "pcc looks ok to me too" [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:30:00] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:31:32] (03PS6) 10Arturo Borrero Gonzalez: cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) [10:31:49] !log A:cp disable-puppet to merge and test https://gerrit.wikimedia.org/r/c/operations/puppet/+/738949/ T293879 [10:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:53] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [10:32:01] (03PS1) 10Muehlenhoff: Add MAC address for testvm2002 [puppet] - 10https://gerrit.wikimedia.org/r/739485 [10:32:46] (03CR) 10Btullis: alertmanager: send releng alerts to both irc and mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738381 (https://phabricator.wikimedia.org/T292284) (owner: 10Hashar) [10:33:04] (03CR) 10Ema: [C: 03+2] varnish: move internal mtail scripts to another instance [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:33:16] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:33:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Christoph Jauera - https://phabricator.wikimedia.org/T295781 (10Jelto) Thanks for the request and creating the change. @Ottomata or @odimitrijevic could you also approve this access request to `analytics-p... [10:35:10] (03PS1) 10JMeybohm: Fix distribution in debian changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/739506 [10:35:29] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Fix distribution in debian changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/739506 (owner: 10JMeybohm) [10:36:59] (03CR) 10Muehlenhoff: [C: 03+2] Add MAC address for testvm2002 [puppet] - 10https://gerrit.wikimedia.org/r/739485 (owner: 10Muehlenhoff) [10:37:39] (03CR) 10Jelto: [C: 03+2] admin: add ssh key for saisuman and add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/739481 (https://phabricator.wikimedia.org/T295552) (owner: 10Jelto) [10:37:45] !log imported wmf-certificates 0~20211110-1 to stretch-wikimedia,buster-wikimedia,bullseye-wikimedia [10:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:51] (03PS3) 10Ema: prometheus:ops: add varnishmtail-internal jobs [puppet] - 10https://gerrit.wikimedia.org/r/739227 (https://phabricator.wikimedia.org/T293879) [10:39:11] (03PS11) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [10:39:13] (03PS7) 10Arturo Borrero Gonzalez: cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) [10:40:23] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:41:16] !log A:cp re-enable puppet after testing https://gerrit.wikimedia.org/r/c/operations/puppet/+/738949/ T293879 [10:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:20] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [10:41:42] (03PS8) 10Arturo Borrero Gonzalez: cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) [10:42:14] !log replaced all references to deploy1001 with deploy1002 in all .git/DEPLOY_HEAD directories on deploy1002:/srv/deployment [10:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:42] 10SRE, 10Scap, 10Release-Engineering-Team (Priority Backlog 📥): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10hnowlan) >>! In T197470#7507203, @dancy wrote: >>>! In T197470#7506161, @hnowlan wrote: >> For the immediate term if t... [10:43:39] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:45:04] (03CR) 10Vgutierrez: [C: 03+1] prometheus:ops: add varnishmtail-internal jobs [puppet] - 10https://gerrit.wikimedia.org/r/739227 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:45:16] !log Commencing manual config on cr3-ulsfo and cr4-ulsfo (site depooled) to reconfigure iBGP (T295672) [10:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:19] T295672: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 [10:45:49] !log restarting blazegraph on wdqs1013 (jvm stuck) [10:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:20] (03CR) 10Hnowlan: [C: 03+2] maps: Make silent cURL requests on tile invalidation [puppet] - 10https://gerrit.wikimedia.org/r/739241 (owner: 10Jgiannelos) [10:46:36] (03CR) 10Vgutierrez: [C: 03+1] varnish: remove internal mtail scripts from default instance [puppet] - 10https://gerrit.wikimedia.org/r/739229 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:46:44] (03PS12) 10Arturo Borrero Gonzalez: openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) [10:46:46] (03PS9) 10Arturo Borrero Gonzalez: cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) [10:47:09] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739227 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:48:17] 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) If we look at another host that is not in the list, but was purchased and installed at the same time as an-worker110[45] (un... [10:49:02] (03PS4) 10JMeybohm: admin_ng: Add helmfile for cert-manager and cfssl-issuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) [10:51:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchangeslinked from s5 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17753 and previous config saved to /var/cache/conftool/dbconfig/20211117-105120-marostegui.json [10:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:25] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [10:52:09] (03PS9) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [10:53:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762 (10BTullis) [10:53:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/32469/" [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:53:36] 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10BTullis) 05Open→03Resolved Committed. The results are here: https://netbox.wikimedia.org/extras/scripts/results/1924060/ Results... [10:53:53] (03CR) 10Jbond: [C: 03+1] profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [10:53:58] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10MatthewVernon) I don't think so, no - the frontends will not route requests to down servers (at least in theory!); we'll be more vulnerable to failur... [10:54:48] (03CR) 10Hashar: alertmanager: send releng alerts to both irc and mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738381 (https://phabricator.wikimedia.org/T292284) (owner: 10Hashar) [10:56:01] (03PS7) 10Jbond: apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) [10:56:17] (03CR) 10Jbond: [C: 03+2] apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [10:56:19] (03CR) 10Jbond: [V: 03+2 C: 03+2] apereo_cas: add cas_u2f script [puppet] - 10https://gerrit.wikimedia.org/r/739279 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [10:56:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova: factorize libvirt secrets management [puppet] - 10https://gerrit.wikimedia.org/r/739223 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [10:56:38] (03CR) 10Ema: [C: 03+2] prometheus:ops: add varnishmtail-internal jobs [puppet] - 10https://gerrit.wikimedia.org/r/739227 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [10:58:37] (03CR) 10Btullis: [C: 03+2] alertmanager: send releng alerts to both irc and mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738381 (https://phabricator.wikimedia.org/T292284) (owner: 10Hashar) [10:58:44] (03PS10) 10Arturo Borrero Gonzalez: cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) [10:59:17] (03CR) 10jerkins-bot: [V: 04-1] admin_ng: Add helmfile for cert-manager and cfssl-issuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [11:03:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "PCC makes more sense after the previous patch has been merged:" [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [11:05:49] (03PS6) 10Giuseppe Lavagetto: sslcert::ca_deselect_dstx3 for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [11:06:11] 10SRE, 10SRE-tools, 10Analytics, 10Data-Engineering, and 3 others: an-worker hosts: Netbox - PuppetDB interfaces discrepancies - https://phabricator.wikimedia.org/T295763 (10Volans) Thanks a lot! [11:08:37] (03PS5) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) [11:08:49] (03PS5) 10JMeybohm: admin_ng: Add helmfile for cert-manager and cfssl-issuer [deployment-charts] - 10https://gerrit.wikimedia.org/r/737939 (https://phabricator.wikimedia.org/T294560) [11:08:51] (03PS7) 10JMeybohm: admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) [11:09:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sslcert::ca_deselect_dstx3 for envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [11:09:33] (03CR) 10David Caro: [C: 03+1] cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [11:09:36] !log installing testvm2002 [11:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: ceph: client: rbd_libvirt: enable ceph::auth::conf [puppet] - 10https://gerrit.wikimedia.org/r/739474 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [11:10:41] (03CR) 10Muehlenhoff: cookbook sre.idm.u2f: add cookbook to enable/disable u2f (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [11:11:07] (03CR) 10Giuseppe Lavagetto: "I merged this patch as we need it for new appservers etc. But I'm up to fix the logical issues in dependencies going forward." [puppet] - 10https://gerrit.wikimedia.org/r/725331 (https://phabricator.wikimedia.org/T292291) (owner: 10BBlack) [11:11:40] (03CR) 10JMeybohm: "Moved Rakefile changes and admin_ng fixtures to previous commit to have that logic grouped together." [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [11:12:21] (03PS16) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [11:15:01] (03PS2) 10Ema: varnish: remove internal mtail scripts from default instance [puppet] - 10https://gerrit.wikimedia.org/r/739229 (https://phabricator.wikimedia.org/T293879) [11:15:06] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739229 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [11:15:16] (03CR) 10Volans: [C: 04-1] "Couple of typos/concerns inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [11:20:35] (03PS1) 10Arturo Borrero Gonzalez: cloud: don't deploy cinder keyring in cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/739516 (https://phabricator.wikimedia.org/T293752) [11:22:10] (03CR) 10David Caro: [C: 03+1] cloud: don't deploy cinder keyring in cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/739516 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [11:22:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: don't deploy cinder keyring in cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/739516 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [11:23:13] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01016 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:25:58] all cloud virt nodes --^ [11:26:09] arturo: --^ [11:26:12] :) [11:26:29] /go moritzm [11:27:49] (03PS4) 10Jbond: cookbook sre.puppet.netbox: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 [11:28:06] (03CR) 10Jbond: "updated" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [11:28:48] elukey: thanks! yeah... we're on top of it already! [11:29:39] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005643 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:29:54] arturo: I was sure about it, just wanted to ping you as FYI :) thanks! [11:30:18] elukey: thanks, appreciated :-) should be solved now hopefully [11:32:32] (03CR) 10Santhosh: [C: 03+1] Enable more languages for Section Translation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739391 (https://phabricator.wikimedia.org/T294223) (owner: 10KartikMistry) [11:34:52] (03PS6) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) [11:35:05] (03CR) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [11:37:16] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [11:38:29] (03CR) 10Muehlenhoff: cookbook sre.idm.u2f: add cookbook to enable/disable u2f (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [11:39:00] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10BTullis) I don't believe that we need to do any prep or depooling work for furud.codfw.wmnet We can downtime it in Icinga, but I think that's the lim... [11:39:14] (03PS7) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) [11:40:35] (03PS10) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [11:41:37] (03CR) 10Jbond: [C: 03+2] P::configmaster: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/736499 (owner: 10Majavah) [11:43:26] (03PS8) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) [11:47:26] (03CR) 10Volans: [C: 03+1] "LGTM, one possible leftover comment inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [11:52:46] (03PS11) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211117T1200). [12:00:04] Eigyan, kart_, and samwilson: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:09] o/ [12:00:22] hullo [12:00:30] o/ [12:00:38] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: support kubernetes in php-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) [12:00:46] \0 [12:00:49] Lucas_WMDE: want to deploy, or should I? :) [12:00:57] either way is fine by me :) [12:01:04] I have no specific plans for this window [12:01:27] I can do it if you want [12:01:37] That'd be cool Lucas_WMDE :) [12:01:41] ok! [12:01:51] then let’s start with kart_ [12:02:00] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: support kubernetes in php-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/739520 (https://phabricator.wikimedia.org/T288851) [12:03:34] (03PS2) 10Lucas Werkmeister (WMDE): Enable more languages for Section Translation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739391 (https://phabricator.wikimedia.org/T294223) (owner: 10KartikMistry) [12:03:43] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable more languages for Section Translation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739391 (https://phabricator.wikimedia.org/T294223) (owner: 10KartikMistry) [12:04:34] (03Merged) 10jenkins-bot: Enable more languages for Section Translation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739391 (https://phabricator.wikimedia.org/T294223) (owner: 10KartikMistry) [12:05:00] kart_: the change is on mwdebug1001, can you test it? [12:05:10] Yes. [12:05:24] * Lucas_WMDE looks at the other change in the meantime [12:06:56] Lucas_WMDE: looks good! Please deploy. [12:07:02] ok! [12:07:39] (03PS17) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [12:08:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:52] (03PS9) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) [12:08:56] (03PS12) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [12:09:05] (03CR) 10Jbond: cookbook sre.idm.u2f: add cookbook to enable/disable u2f (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [12:09:31] (03PS1) 10Arturo Borrero Gonzalez: openstack: radosgw: migrate to new ceph auth abstraction [puppet] - 10https://gerrit.wikimedia.org/r/739522 (https://phabricator.wikimedia.org/T293752) [12:09:37] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:739391|Enable more languages for Section Translation in testwiki (T294223)]] (duration: 01m 52s) [12:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:40] T294223: Enable more languages for Section Translation in test wiki - https://phabricator.wikimedia.org/T294223 [12:09:51] (03CR) 10Jbond: hiera: create script endpoint for exporting hiera data (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [12:10:58] samwilson: the task description says that “Translations should also be in place”, but according to https://codesearch.wmcloud.org/search/?q=%22disambiguator-(notification-(question%7Csummary)%7Creview-link)%22 not all six relevant languages have all of these messages translated yet; is that okay? [12:11:18] (03PS1) 10Arturo Borrero Gonzalez: hieradata: codfw: ceph: add dummy keydata for radosgw [labs/private] - 10https://gerrit.wikimedia.org/r/739523 (https://phabricator.wikimedia.org/T293752) [12:11:19] meh, my IRC client broke that link, idk if it’ll look better on your end [12:11:19] !log failover ganeti master in test cluster to ganeti-test2003 [12:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:36] !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [12:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:44] Thanks Lucas_WMDE ! [12:12:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:31] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [12:12:37] Lucas_WMDE: yep, that's okay [12:12:42] ok [12:12:47] (03PS2) 10Lucas Werkmeister (WMDE): Enable disambiguator notifications on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739467 (https://phabricator.wikimedia.org/T293319) (owner: 10Samwilson) [12:13:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable disambiguator notifications on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739467 (https://phabricator.wikimedia.org/T293319) (owner: 10Samwilson) [12:13:31] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: codfw: ceph: add dummy keydata for radosgw [labs/private] - 10https://gerrit.wikimedia.org/r/739523 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:14:11] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32472/console" [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [12:14:20] (03CR) 10Btullis: "I'm happy with the change, but after your explanation yesterday I looked for the effect that you mentioned in grafana and I couldn't find " [cookbooks] - 10https://gerrit.wikimedia.org/r/739240 (owner: 10Elukey) [12:14:29] (03Merged) 10jenkins-bot: Enable disambiguator notifications on 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739467 (https://phabricator.wikimedia.org/T293319) (owner: 10Samwilson) [12:15:33] (03PS1) 10Cathal Mooney: Revert "Depool ulsfo to allow for safe reconfig of CR routers there" [dns] - 10https://gerrit.wikimedia.org/r/739490 [12:15:37] samwilson: that change is on mwdebug1001 now, please test :) [12:15:56] thanks. testing now. [12:16:08] (03PS18) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [12:16:43] (03CR) 10Cathal Mooney: [C: 03+2] Revert "Depool ulsfo to allow for safe reconfig of CR routers there" [dns] - 10https://gerrit.wikimedia.org/r/739490 (owner: 10Cathal Mooney) [12:17:14] PROBLEM - ganeti-wconfd running on ganeti-test2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:17:18] !log Re-pooling ulsfo after completing routing changes on cr3-ulsfo and cr4-ulsfo (T295672) [12:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:21] T295672: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 [12:18:29] I'm testing that now as well [12:18:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/32473/console" [puppet] - 10https://gerrit.wikimedia.org/r/739522 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:20:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:22:03] (03PS1) 10Ladsgroup: export: Ignore rev_page_id index [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739491 (https://phabricator.wikimedia.org/T285149) [12:22:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [12:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:12] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:22:15] (03PS13) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [12:22:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:55] Lucas_WMDE: you're good to go, with the disambig change. thanks! [12:23:09] ok! [12:24:26] (03PS19) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [12:24:31] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:739467|Enable disambiguator notifications on 6 Wikipedias (T293319)]] (duration: 01m 04s) [12:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:35] T293319: Rollout plan for disambiguation notifications (wgDisambiguatorNotifications) - https://phabricator.wikimedia.org/T293319 [12:25:22] no sign of Essex/Eigyan yet afaict [12:26:17] (03PS20) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [12:26:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:03] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra: move cluster:user relation from 1:1 relation to a 1:many [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [12:28:08] (03PS21) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [12:29:54] (03CR) 10Jbond: "ready for review" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [12:30:26] (03PS22) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [12:31:08] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:34:12] (03PS23) 10Jbond: hiera: create script endpoint for exporting hiera data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 [12:34:47] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Marostegui) a:03Marostegui I will take care of this [12:36:12] (03CR) 10ArielGlenn: [C: 03+1] "Approved by me, I guess this will go in a backport window." [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739491 (https://phabricator.wikimedia.org/T285149) (owner: 10Ladsgroup) [12:36:40] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:42:44] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10Jelto) p:05Triage→03Medium Would it be possible to use the official template for access requests? You can find... [12:50:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T295552 (10Jelto) 05In progress→03Resolved a:03Jelto @SCherukuwada you should have access now to `analytics-privatedata-users`. I'm closing this task. Feel free to re-open i... [12:50:16] eigyan20: are you the Essex/EIgyan in the deployment calendar? [12:50:22] for the QuickSurveys patch? [12:50:36] Yes [12:50:50] alright [12:50:52] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 107 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:50:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [12:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:41] wondering why my eigyan handle isn't showing up I will research that [12:51:49] those two page IDs in the patch look like they’re specific to beta enwiki [12:52:01] is the survey limited to that wiki or will it show up on other beta wikis as well? [12:52:17] I don’t see a reference to enwiki in that patch, but I’m not very familiar with quicksurveys in general [12:52:19] it is limited to that wiki and locked down via page ID [12:53:44] this patch is intended for the Beta Cluster(labs) only for now [12:54:07] ok but what if the same page ID exists on another beta wiki? [12:54:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [12:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:24] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 38 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:55:45] (I’m currently trying to figure out if there are any other such wikis) [12:57:24] simplewiki has the second page ID: https://simple.wikipedia.beta.wmflabs.org/w/index.php?curid=265895 [12:59:25] (03CR) 10Lucas Werkmeister (WMDE): "If the survey is limited to two page IDs, shouldn’t it also be limited to a single wiki? The second page ID exists on another wiki: https:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [12:59:27] (03PS1) 10Arturo Borrero Gonzalez: cloudbackup100X-dev: add insetup role [puppet] - 10https://gerrit.wikimedia.org/r/739535 (https://phabricator.wikimedia.org/T295584) [12:59:30] left a comment on the change [12:59:37] Will this patch then effect that page ID on simplewiki; my first page of this kind so I don't know unfortunately [12:59:48] I assume it would, yes [13:00:25] the backport window just ended anyways, I hope it’s okay to defer this until the next window? [13:01:09] sure I will get the page Id's sorted and reschedule. Thanks [13:01:21] ok, good luck! [13:02:09] (on beta wikidatawiki the page_id counter also exceeds that number – it’s somewhere above 800K in fact – but those two specific page IDs were apparently deleted) [13:02:23] (they can still be found in the archive table) [13:02:32] !log UTC morning backport+config window done [13:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:27] (03PS2) 10Arturo Borrero Gonzalez: cloudbackup100X-dev: add insetup role [puppet] - 10https://gerrit.wikimedia.org/r/739535 (https://phabricator.wikimedia.org/T295584) [13:10:45] !log aborrero@cumin1001 START - Cookbook sre.ganeti.makevm for new host cloudbackup1001-dev.eqiad.wmnet [13:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudbackup100X-dev: add insetup role [puppet] - 10https://gerrit.wikimedia.org/r/739535 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [13:20:26] !log aborrero@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host cloudbackup1001-dev.eqiad.wmnet [13:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:59] 10SRE, 10vm-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10aborrero) Created 1 VM with: ` aborrero@cumin1001:~ $ sudo cookbook sre.ganeti.makevm eqiad_B cloudbackup1001-dev --vcpus 2 --memory 4 ` no errors in t... [13:24:35] (03PS20) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [13:24:43] (03PS11) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [13:25:48] (03PS21) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [13:26:31] (03PS12) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [13:27:38] 10SRE, 10Analytics, 10LDAP-Access-Requests: LDAP access to the wmf group for Brooke Camarda & Olga Spingou (superset, turnilo, hue) - https://phabricator.wikimedia.org/T295828 (10Aklapper) 05Open→03Invalid Basically what Peachey88 wrote - please split this task, and use the template link to fill out the... [13:28:32] (03PS6) 10David Caro: Added cookbook to create an nfs server [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [13:32:19] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/737914 (owner: 10Jbond) [13:35:12] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [13:37:32] (03PS1) 10Jgiannelos: tile-pregeneration: Silent cURL with faster timeout [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/739538 [13:44:04] (03CR) 10Jbond: [C: 03+2] cookbook sre.idm.u2f: add cookbook to enable/disable u2f (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [13:44:12] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10Gehel) The elasticsearch cluster should be able to cope with loosing 2 nodes with no issues. Thanks for flagging this, and please ping @RKemper and m... [13:47:30] (03Merged) 10jenkins-bot: cookbook sre.idm.u2f: add cookbook to enable/disable u2f [cookbooks] - 10https://gerrit.wikimedia.org/r/739276 (https://phabricator.wikimedia.org/T295579) (owner: 10Jbond) [13:48:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchanges from s5 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17754 and previous config saved to /var/cache/conftool/dbconfig/20211117-134835-marostegui.json [13:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:40] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [13:49:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Change weights on s5 special slaves in eqiad T263127', diff saved to https://phabricator.wikimedia.org/P17755 and previous config saved to /var/cache/conftool/dbconfig/20211117-134942-marostegui.json [13:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32474/console" [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:02:32] (03CR) 10Elukey: sre.druid.roll-restart-workers: restart Druid exporter (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/739240 (owner: 10Elukey) [14:05:56] PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [14:13:04] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet - https://phabricator.wikimedia.org/T295563 (10MatthewVernon) @Papaul I'm trying to `xfs_repair` the filesystem, which is a lengthy process, but I'm seeing medium errors in the kernel log aga... [14:20:22] PROBLEM - ganeti-mond running on ganeti-test2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [14:20:32] PROBLEM - ganeti-noded running on ganeti-test2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [14:20:32] PROBLEM - ganeti-confd running on ganeti-test2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [14:22:08] ^ monitoring glitch from update test, I'm silencing those [14:22:34] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Ganeti update tests [14:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti-test[2001-2003].codfw.wmnet with reason: Ganeti update tests [14:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:32] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10sbassett) [14:41:40] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10sbassett) >>! In T295790#7509658, @Jelto wrote: > Would it be possible to use the official template for access reque... [14:46:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10elukey) Sorry @Jclark-ctr some other discussions started with Service Ops about networking, we still need to reach a quorum about final settings (vlan,... [14:47:59] !log installing perl bugfix updates from Bullseye point release [14:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:13] (03PS4) 10Giuseppe Lavagetto: mediawiki: add handling of php-fpm logs via rsyslogd [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) [15:12:11] (03CR) 10Dzahn: [V: 03+1 C: 03+2] admin: add Julia Kieserman to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/739371 (https://phabricator.wikimedia.org/T295693) (owner: 10Dzahn) [15:14:46] (03PS5) 10Giuseppe Lavagetto: mediawiki: add handling of php-fpm logs via rsyslogd [deployment-charts] - 10https://gerrit.wikimedia.org/r/734692 (https://phabricator.wikimedia.org/T288851) [15:20:50] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10lmata) @ayounsi after a chat with the team we think we should be fine, we will monitor and be available should something happen. [15:20:54] (03PS22) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [15:21:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32475/console" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:23:01] (03PS23) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [15:23:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32476/console" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:24:12] (03PS6) 10Giuseppe Lavagetto: add miscweb to LVS [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:25:39] (03CR) 10JMeybohm: [C: 04-1] istio: Fix main config, add basic NetworkPolicy for staging/ml-serve (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [15:26:22] (03PS1) 10KartikMistry: Enable Tamil (ta) Section Translation in test wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739550 (https://phabricator.wikimedia.org/T294223) [15:26:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] add miscweb to LVS [puppet] - 10https://gerrit.wikimedia.org/r/694625 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:27:03] :) [15:27:32] <_joe_> merging now [15:27:35] (03PS3) 10Majavah: wikimedia.org: add ldap-rw to replace ldap-labs [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) [15:27:52] (03PS1) 10JMeybohm: admin_ng/common: Add a warning to allowCriticalPods switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/739551 [15:28:04] (03CR) 10Majavah: wikimedia.org: add ldap-rw to replace ldap-labs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [15:28:10] <_joe_> mutante: running on a random k8s worker [15:30:02] (03CR) 10David Caro: "@Arturo, remember to re-fetch this PR before continuing to work on it, I rebased it on top of master." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [15:30:05] ACK, preparing cumin run on A:kuberetes-worker with -b 30 [15:30:17] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Joe) [15:30:21] (03CR) 10David Caro: Added cookbook to create an nfs server (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/736915 (owner: 10Andrew Bogott) [15:31:11] waiting for OK to run it [15:31:16] (03PS1) 10BBlack: drmrs: define dual ganeti clusters [puppet] - 10https://gerrit.wikimedia.org/r/739553 (https://phabricator.wikimedia.org/T282787) [15:31:31] <_joe_> mutante: go on [15:31:41] alright, running [15:31:52] (03PS2) 10Giuseppe Lavagetto: service/miscweb: switch state from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/694628 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:31:57] batch 30 with 34 hosts :) [15:32:03] <_joe_> eheh [15:32:14] <_joe_> I wasn't sure how much higher we were with the node count by now [15:32:28] *nod* [15:32:29] <_joe_> and right, this is onlyt he eqiad/codfw main clusters [15:33:09] everything with the kubernetes::worker class, yes [15:33:24] 100% done [15:33:26] no fails [15:33:31] (03CR) 10AOkoth: [C: 03+1] role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:33:50] it added the LVS_SERVICE_IP [15:35:00] !log drain ulsfo-codfw link [15:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:32] <_joe_> ok [15:35:40] <_joe_> let's go with the next patch then [15:35:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service/miscweb: switch state from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/694628 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [15:35:59] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/739553 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [15:36:01] <_joe_> this will bring us to lvs_setup [15:36:06] service_setup to lvs_setup, ack [15:36:08] <_joe_> and we will have to restart the low-traffic pybals [15:36:47] <_joe_> I'll start from lvs2010 [15:36:54] <_joe_> first run puppet, then restart pybal [15:37:09] ah, so I was wondering if we do this via cumin or not, thanks [15:37:10] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) Ping! This would also make accessing test results less cumbersome without needing to set up ssh tunnels. [15:37:15] <_joe_> no never [15:37:18] ack [15:37:22] <_joe_> pybal needs to be restarted in order [15:37:26] <_joe_> first the two backups [15:37:38] <_joe_> you can actuall start on lvs1016 if you want by running puppet [15:38:03] <_joe_> so on an lvs server, if you run [15:38:05] <_joe_> curl -s localhost:9090/metrics | grep -i pybal_bgp_session_established [15:38:14] <_joe_> it will tell you which bgp sessions are established [15:38:18] <_joe_> by pybal [15:38:21] <_joe_> once you restart it [15:38:31] <_joe_> you need to wait for those metrics to go back to what they were [15:38:43] (03PS1) 10Bartosz Dziewoński: Make reply tool available as opt-out on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739556 (https://phabricator.wikimedia.org/T295838) [15:38:49] <_joe_> (so "1.0") [15:38:58] <_joe_> before you can restart the other pybal safely [15:39:13] <_joe_> !log restarting pybal on lvs2010 [15:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:30] ok! this is the part I needed. running puppet on lvs1016 [15:39:39] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:39:58] (03CR) 10Ema: [C: 03+1] cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:40:09] _joe_: puppet run finished and that value is 1.0 right away [15:40:27] <_joe_> mutante: because pyabl is not restarted by puppet [15:40:38] <_joe_> exactly because it needs coordination [15:40:55] I understood that as waiting until it's 1.0 before restarting, ok [15:41:17] <_joe_> no before restarting the next lvs [15:41:23] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10RobH) [15:41:40] ok, so I can restart on 1016 now [15:41:43] (03PS1) 10Vgutierrez: site: Reimage cp2042 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/739557 (https://phabricator.wikimedia.org/T290005) [15:41:49] checked the command you used for that too [15:41:52] (03CR) 10Ema: [C: 03+2] varnish: remove internal mtail scripts from default instance [puppet] - 10https://gerrit.wikimedia.org/r/739229 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [15:42:11] !log restarting pybal on lvs1016 [15:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:49] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:42:54] running "watch curl..." to check the value that is now 0.0 [15:43:06] we caused the alerts, did we? [15:43:09] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10ayounsi) If you can take pictures of the front panels that could be useful to instruct remote hands when they get to drmrs too. [15:43:11] (03PS24) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [15:43:17] <_joe_> mutante: yes, don't worry [15:43:20] ok [15:43:38] <_joe_> !log restarting pybal on lvs2009 [15:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:41] it's 1.0 now [15:43:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32477/console" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:44:07] <_joe_> mutante: ok, now you can move to lvs1015 [15:44:16] doing [15:44:46] <_joe_> actually I'm not sure why that bgp status alert is still firing [15:45:01] <_joe_> XioNoX / topranks ? [15:45:10] because Icinga just checks every 5 min? [15:45:14] looking [15:45:40] yeah they's all esablished [15:45:43] <_joe_> possibly [15:45:45] <_joe_> yeah ok [15:45:55] <_joe_> mutante: ok, you can restart 1015 then [15:45:58] tells Icinga to hurry up [15:46:05] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:46:42] !log restarting pybal on lvs1015 [15:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:58] <_joe_> mutante: excellent [15:46:59] (03PS1) 10Ema: varnish: fix path to default mtail scripts to remove [puppet] - 10https://gerrit.wikimedia.org/r/739558 (https://phabricator.wikimedia.org/T293879) [15:47:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:47:19] (03CR) 10Esanders: [C: 03+1] Make reply tool available as opt-out on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739556 (https://phabricator.wikimedia.org/T295838) (owner: 10Bartosz Dziewoński) [15:47:19] <_joe_> now if you want to check your service is up from the pov of lvs [15:47:22] lv1015 - back to 1.0 [15:47:39] <_joe_> sudo ipvsadm -Lt 10.2.2.58:4111 [15:47:54] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739558 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [15:48:17] I was about to say "but I dont have the discovery name yet". ok, ack [15:48:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:49:12] _joe_: it gives me a bunch of routes to kubernetes hosts, yes [15:49:18] <_joe_> mutante: let's now check that monitoring requests work [15:50:00] curl kubernetes1014.eqiad.wmnet:4111/healthz [15:50:05] 404 cough [15:50:59] <_joe_> uh wait [15:51:01] <_joe_> no TLS? [15:51:16] (03CR) 10JMeybohm: [C: 03+2] admin_ng/common: Add a warning to allowCriticalPods switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/739551 (owner: 10JMeybohm) [15:51:32] <_joe_> mutante: why no tls? [15:51:42] <_joe_> we should have tls on by default by now [15:52:06] _joe_: I see that too but did not expect it. SSL routines:ssl3_get_record:wrong version number [15:52:25] eh yea, I need to fix this before we can continue then obviously [15:52:30] <_joe_> yep :) [15:52:38] <_joe_> sorry gotta go afk for 5 minute [15:52:47] is there anything bad about it being in this state ? [15:53:12] (03CR) 10Ema: [C: 03+2] varnish: fix path to default mtail scripts to remove [puppet] - 10https://gerrit.wikimedia.org/r/739558 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [15:53:29] yea, i'll work on that and we get back to it another time. still glad about the parts already done [15:53:32] thank you [15:53:59] (03PS13) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [15:54:38] (03CR) 10Ahmon Dancy: gitlab-runner: restrict docker images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [15:55:32] !log move codfw-ulsfo link to break-out cable [15:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [15:56:01] (03CR) 10Jbond: [V: 03+2] mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [15:56:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [15:58:18] !log disable Telia BGP on cr1-codfw [15:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:26] (03CR) 10Ahmon Dancy: [C: 03+1] "Sounds good to me." [puppet] - 10https://gerrit.wikimedia.org/r/739366 (owner: 10Dzahn) [15:58:30] (03PS1) 10Ema: varnish: remove code used to clean up old mtail scripts [puppet] - 10https://gerrit.wikimedia.org/r/739560 (https://phabricator.wikimedia.org/T293879) [15:58:36] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab-runners: move puppetmaster setting to repo [puppet] - 10https://gerrit.wikimedia.org/r/739367 (owner: 10Dzahn) [15:59:04] !log netbox: added ganeti01 and ganeti02 cluster definitions for drmrs [15:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:37] (03CR) 10BBlack: [C: 03+2] drmrs: define dual ganeti clusters [puppet] - 10https://gerrit.wikimedia.org/r/739553 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [15:59:46] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: exim4.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:54] PROBLEM - Exim SMTP on mx2001 is CRITICAL: connect to address 208.80.153.45 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [16:01:04] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [16:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:10] (03Merged) 10jenkins-bot: admin_ng/common: Add a warning to allowCriticalPods switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/739551 (owner: 10JMeybohm) [16:02:42] (03PS1) 10Jbond: P:mail::mx: move otrs_aliases_file to the top of the file [puppet] - 10https://gerrit.wikimedia.org/r/739561 [16:03:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32478/console" [puppet] - 10https://gerrit.wikimedia.org/r/739561 (owner: 10Jbond) [16:03:57] (03PS1) 10David Caro: wmcs: Introduce function run_one to run a command [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739562 [16:04:00] (03PS1) 10David Caro: wmcs: use raw help formatter and module docs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739563 [16:04:19] (03CR) 10Dzahn: [C: 03+1] P:mail::mx: move otrs_aliases_file to the top of the file [puppet] - 10https://gerrit.wikimedia.org/r/739561 (owner: 10Jbond) [16:04:25] !log re-enable Telia BGP on cr1-codfw [16:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:54] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:06:02] !log move cr1-codfw:xe-5/3/0 to BO cable [16:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:54] (03PS2) 10Jbond: P:mail::mx: move otrs_aliases_file to the top of the file [puppet] - 10https://gerrit.wikimedia.org/r/739561 [16:10:47] (03PS1) 10Bearloga: Continue decommissioning legacy Discovery dashboards [puppet] - 10https://gerrit.wikimedia.org/r/739564 (https://phabricator.wikimedia.org/T227782) [16:12:14] (03CR) 10jerkins-bot: [V: 04-1] Continue decommissioning legacy Discovery dashboards [puppet] - 10https://gerrit.wikimedia.org/r/739564 (https://phabricator.wikimedia.org/T227782) (owner: 10Bearloga) [16:12:54] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:13:40] (03CR) 10Jbond: [C: 03+2] P:mail::mx: move otrs_aliases_file to the top of the file [puppet] - 10https://gerrit.wikimedia.org/r/739561 (owner: 10Jbond) [16:14:52] (03PS1) 10Jbond: Revert "mx2001: disable ldap validation" [puppet] - 10https://gerrit.wikimedia.org/r/739496 [16:14:54] (03PS1) 10Jbond: Revert "role::exim: update config to drop ldap validation" [puppet] - 10https://gerrit.wikimedia.org/r/739497 [16:15:29] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "mx2001: disable ldap validation" [puppet] - 10https://gerrit.wikimedia.org/r/739496 (owner: 10Jbond) [16:15:43] (03PS2) 10Bearloga: Continue decommissioning legacy Discovery dashboards [puppet] - 10https://gerrit.wikimedia.org/r/739564 (https://phabricator.wikimedia.org/T227782) [16:15:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "role::exim: update config to drop ldap validation" [puppet] - 10https://gerrit.wikimedia.org/r/739497 (owner: 10Jbond) [16:16:34] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:40] RECOVERY - Exim SMTP on mx2001 is OK: OK - Certificate mx1001.wikimedia.org will expire on Tue 04 Jan 2022 11:55:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [16:17:36] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 31 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:19:06] !log cmooney@cumin2002 START - Cookbook sre.hosts.decommission for hosts rpki2001.codfw.wmnet [16:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:05] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10thcipriani) Approved! (Sorry for the delay) [16:21:05] !log move cr1-codfw<->cr2-eqdfw link to BO cable [16:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:03] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:22:09] expected ^ [16:24:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:25:07] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:25:56] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) There are actually 2 levels of access, parsoid-test-admins and parsoid-test-roots. test-admins has these sudo privs: ` 654... [16:27:18] (03CR) 10BryanDavis: [C: 03+1] python39: Use shell reimplementation of webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [16:27:18] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts rpki2001.codfw.wmnet [16:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:03] !log drain Telia eqiad-codfw link [16:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:05] !log cmooney@cumin2002 START - Cookbook sre.hosts.decommission for hosts rpki2001.codfw.wmnet [16:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:38] (03PS3) 10Dzahn: admin: add Julia Kieserman to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/739371 (https://phabricator.wikimedia.org/T295693) [16:31:46] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet - https://phabricator.wikimedia.org/T295563 (10MatthewVernon) `xfs_repair` found a number of problems with the filesystem, and more medium errors were reported by the kernel: ` Nov 17 14:24:4... [16:32:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:32:55] !log LDAP - added jkieserman to wmf (T295693) [16:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:59] T295693: Grant Access to LDAP/WMF for JKieserman - https://phabricator.wikimedia.org/T295693 [16:33:13] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:55] PROBLEM - clamd running on otrs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [16:34:36] !log cmooney@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts rpki2001.codfw.wmnet [16:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:44] 10SRE, 10Infrastructure-Foundations, 10netops: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin2002 for hosts: `rpki2001.codfw.wmnet` - rpki2001.codfw.wmnet (**FAIL**) - **Host steps... [16:34:57] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to LDAP/WMF for JKieserman - https://phabricator.wikimedia.org/T295693 (10Dzahn) Hi @JKieserman you have now been added to the "wmf" LDAP group as requested. Various web logins should now work. You can find a list here: https://wikitech.wikim... [16:35:28] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to LDAP/WMF for JKieserman - https://phabricator.wikimedia.org/T295693 (10Dzahn) 05Open→03Resolved a:03Dzahn [16:39:17] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:59] RECOVERY - clamd running on otrs1001 is OK: PROCS OK: 1 process with UID = 112 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/OTRS%23ClamAV [16:41:28] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Jelto) [16:42:03] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Idle - Telia, AS1299/IPv6: Idle - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:43:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Jelto) @DAbad could you please post your full ssh key and your wikitech username again please? I tried to find you wikitech account but it seems to be `dbad2021` instead of `... [16:44:12] (03CR) 10Brennen Bearnes: gitlab-runner: restrict docker images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [16:44:22] (03PS10) 10Brennen Bearnes: gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [16:46:31] (03CR) 10Volans: "FYI, in case it might be useful, there is also a custom ArgparseFormatter to get both behaviours (raw formatting + defaults) importable wi" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739563 (owner: 10David Caro) [16:46:51] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:49:15] (03PS1) 10Cwhite: profile: remove namespace_name constraint from shellbox filter [puppet] - 10https://gerrit.wikimedia.org/r/739572 [16:49:51] (03CR) 10Jelto: [C: 03+1] "adding @Brennen here as cc in case anything breaks and needs a rollback." [puppet] - 10https://gerrit.wikimedia.org/r/737968 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [16:51:41] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:52:49] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:53:20] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10RobH) [16:53:55] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs01_sync.service,netbox_ganeti_drmrs02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:23] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:57:35] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:58:37] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 295 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:01:03] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:01:07] ^ jobrunners, not appservers, zooming out those spikes happen and it seems over. ACK [17:04:24] <_joe_> mutante: any idea what caused it? [17:05:06] <_joe_> TypeError: Argument 2 passed to Parser::preSaveTransform() must implement interface MediaWiki\Page\PageReference, null given, called in /srv/mediawiki/php-1.38.0-wmf.7/includes/preferences/DefaultPreferencesFactory.php on line seems to be the problem [17:05:21] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:06:41] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 75, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:07:07] <_joe_> *all* for metawiki [17:07:37] _joe_, I recently created a task about that [17:08:24] https://phabricator.wikimedia.org/T295543 [17:09:18] in theory it is fixed on head, just pending deployment [17:10:35] !log cmooney@cumin2002 START - Cookbook sre.ganeti.makevm for new host rpki2002.codfw.wmnet [17:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:30] !log repool Telia eqiad-codfw transport [17:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:55] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10ihurbain) For the record: I don't seem to have access to bastions either - it asks for a password instead of doing a key auth, on what... [17:15:46] (03CR) 10Legoktm: [C: 03+1] "Make sense" [puppet] - 10https://gerrit.wikimedia.org/r/739572 (owner: 10Cwhite) [17:16:01] RECOVERY - Check systemd state on ms-be2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:25:52] !log cmooney@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host rpki2002.codfw.wmnet [17:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:48] (03PS1) 10Herron: exim4.conf.mx: switch 'data' to 'condition' in otrs config [puppet] - 10https://gerrit.wikimedia.org/r/739579 (https://phabricator.wikimedia.org/T244792) [17:27:29] (03CR) 10jerkins-bot: [V: 04-1] exim4.conf.mx: switch 'data' to 'condition' in otrs config [puppet] - 10https://gerrit.wikimedia.org/r/739579 (https://phabricator.wikimedia.org/T244792) (owner: 10Herron) [17:28:07] (03PS2) 10Herron: exim4.conf.mx: switch 'data' to 'condition' in otrs config [puppet] - 10https://gerrit.wikimedia.org/r/739579 (https://phabricator.wikimedia.org/T244792) [17:29:08] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/739579 (https://phabricator.wikimedia.org/T244792) (owner: 10Herron) [17:30:33] (03PS1) 10Cathal Mooney: Add DHCP entry for install of rpki2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/739580 (https://phabricator.wikimedia.org/T292503) [17:32:05] (03CR) 10Herron: "I've confirmed manually that exim will run with after this change on deployment-mx03.deployment-prep.eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/739579 (https://phabricator.wikimedia.org/T244792) (owner: 10Herron) [17:33:52] (03CR) 10Jbond: "great thanks LGTM, lets aim to deploy this simlar time tomorrow?" [puppet] - 10https://gerrit.wikimedia.org/r/739579 (https://phabricator.wikimedia.org/T244792) (owner: 10Herron) [17:34:23] (03CR) 10Herron: exim4.conf.mx: switch 'data' to 'condition' in otrs config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739579 (https://phabricator.wikimedia.org/T244792) (owner: 10Herron) [17:36:53] (03CR) 10VolkerE: [C: 04-1] Add new icons, wordmarks, taglines for several wikis: (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [17:38:03] (03PS1) 10Arturo Borrero Gonzalez: cloudbackup1001-dev: update DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/739581 (https://phabricator.wikimedia.org/T295584) [17:39:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudbackup1001-dev: update DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/739581 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [17:41:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [17:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:41] (03CR) 10Herron: [C: 03+1] profile: remove namespace_name constraint from shellbox filter [puppet] - 10https://gerrit.wikimedia.org/r/739572 (owner: 10Cwhite) [17:48:35] (03CR) 10Herron: [C: 03+1] profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [17:50:18] (03PS1) 10BBlack: drmrs ganeti: add cluster cert public keys [puppet] - 10https://gerrit.wikimedia.org/r/739584 [17:50:55] (03CR) 10jerkins-bot: [V: 04-1] drmrs ganeti: add cluster cert public keys [puppet] - 10https://gerrit.wikimedia.org/r/739584 (owner: 10BBlack) [17:52:26] (03PS2) 10BBlack: drmrs ganeti: add cluster cert public keys [puppet] - 10https://gerrit.wikimedia.org/r/739584 (https://phabricator.wikimedia.org/T282787) [17:53:18] (03CR) 10BBlack: [C: 03+2] drmrs ganeti: add cluster cert public keys [puppet] - 10https://gerrit.wikimedia.org/r/739584 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [17:55:28] (03PS1) 10BBlack: Add dummy private keys for drmrs ganeti [labs/private] - 10https://gerrit.wikimedia.org/r/739586 (https://phabricator.wikimedia.org/T282787) [17:58:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739580 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney) [17:59:30] !log depool cp2042 to be reimaged as an HAProxy cache upload node - T290005 [17:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:34] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [18:00:05] (03PS2) 10Cathal Mooney: Add DHCP entry for install of rpki2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/739580 (https://phabricator.wikimedia.org/T292503) [18:00:15] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp2042 as cache::upload_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/739557 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [18:01:51] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [18:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:48] (03CR) 10Cathal Mooney: [C: 03+2] Add DHCP entry for install of rpki2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/739580 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney) [18:05:18] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:57] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS buster [18:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:09] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2042.codfw.wmnet with OS buster [18:06:29] (03PS11) 10Elukey: istio: Fix main config, add basic NetworkPolicy for staging/ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [18:06:40] (03CR) 10Elukey: istio: Fix main config, add basic NetworkPolicy for staging/ml-serve (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/720906 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [18:08:56] (03PS1) 10BBlack: ganeti6: switch to ganeti role [puppet] - 10https://gerrit.wikimedia.org/r/739588 (https://phabricator.wikimedia.org/T282787) [18:11:15] (03CR) 10BBlack: [C: 03+2] ganeti6: switch to ganeti role [puppet] - 10https://gerrit.wikimedia.org/r/739588 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [18:12:53] (03PS1) 10JMeybohm: Skipp (re-)building helm dependencies during CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/739590 [18:15:10] (03PS1) 10Arturo Borrero Gonzalez: site.pp: enable proper role for cloudbackup1001-dev [puppet] - 10https://gerrit.wikimedia.org/r/739591 (https://phabricator.wikimedia.org/T295584) [18:16:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] site.pp: enable proper role for cloudbackup1001-dev [puppet] - 10https://gerrit.wikimedia.org/r/739591 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [18:19:30] (03PS1) 10Volans: sre.hosts.reimage: ask confirmation at the start [cookbooks] - 10https://gerrit.wikimedia.org/r/739593 [18:20:36] (03PS1) 10BBlack: drmrs: include netbox svc file [dns] - 10https://gerrit.wikimedia.org/r/739594 (https://phabricator.wikimedia.org/T282787) [18:20:38] (03CR) 10Vgutierrez: [C: 03+1] sre.hosts.reimage: ask confirmation at the start [cookbooks] - 10https://gerrit.wikimedia.org/r/739593 (owner: 10Volans) [18:21:12] (03PS2) 10JMeybohm: Skip (re-)building helm dependencies during CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/739590 [18:21:56] (03PS3) 10JMeybohm: Skip (re-)building helm dependencies during CI [deployment-charts] - 10https://gerrit.wikimedia.org/r/739590 [18:22:12] (03CR) 10BBlack: [C: 03+2] drmrs: include netbox svc file [dns] - 10https://gerrit.wikimedia.org/r/739594 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [18:23:08] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) Change went well in ulsfo earlier. De-pooled the site in DNS first and then proceeded with steps as outlined above. All went as expected. Did tak... [18:24:38] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: codfw1dev backups: add ldap_user_pass secret [labs/private] - 10https://gerrit.wikimedia.org/r/739595 (https://phabricator.wikimedia.org/T295584) [18:25:42] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: openstack: codfw1dev backups: add ldap_user_pass secret [labs/private] - 10https://gerrit.wikimedia.org/r/739595 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [18:27:06] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:28:07] (03CR) 10Michael DiPietro: [C: 03+1] kubeadm: raise default to 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/739402 (owner: 10Majavah) [18:29:40] (03PS1) 10Arturo Borrero Gonzalez: cloud: codfw1dev: hiera update for new backup servers [puppet] - 10https://gerrit.wikimedia.org/r/739599 (https://phabricator.wikimedia.org/T295584) [18:30:03] (03CR) 10Michael DiPietro: [C: 03+1] aptrepo: drop k8s 1.19 repos [puppet] - 10https://gerrit.wikimedia.org/r/739403 (owner: 10Majavah) [18:30:53] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: ask confirmation at the start [cookbooks] - 10https://gerrit.wikimedia.org/r/739593 (owner: 10Volans) [18:33:58] PROBLEM - configured eth on ganeti6003 is CRITICAL: public reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:34:23] (03Merged) 10jenkins-bot: sre.hosts.reimage: ask confirmation at the start [cookbooks] - 10https://gerrit.wikimedia.org/r/739593 (owner: 10Volans) [18:34:30] PROBLEM - configured eth on ganeti6002 is CRITICAL: public reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:34:30] (03PS1) 10Majavah: toolforge: enable wheel on buster [puppet] - 10https://gerrit.wikimedia.org/r/739600 [18:34:42] (03PS1) 10Cathal Mooney: Modifying globbing for partman recipie for rpki VMs [puppet] - 10https://gerrit.wikimedia.org/r/739601 (https://phabricator.wikimedia.org/T295672) [18:35:04] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:38:54] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739601 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [18:39:22] (03PS4) 10Majavah: wikimedia.org: add ldap-rw to replace ldap-labs [dns] - 10https://gerrit.wikimedia.org/r/739284 (https://phabricator.wikimedia.org/T295150) [18:39:32] (03CR) 10Cathal Mooney: [C: 03+2] Modifying globbing for partman recipie for rpki VMs [puppet] - 10https://gerrit.wikimedia.org/r/739601 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [18:41:08] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:42:15] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) Hi @ihurbain, welcome to WMF. You can't login on those bastions because you don't actually have a shell account yet (in produc... [18:42:44] PROBLEM - configured eth on ganeti6004 is CRITICAL: public reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:45:08] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) @ihurbain Also read L3 and sign it, please. [18:47:30] PROBLEM - configured eth on ganeti6001 is CRITICAL: public reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:48:08] (03PS1) 10Volans: sre.hosts.reimage: additional check of remote host [cookbooks] - 10https://gerrit.wikimedia.org/r/739603 [18:50:12] (03CR) 10Ahmon Dancy: [C: 04-1] "holding." [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) (owner: 10Ahmon Dancy) [18:50:39] (03PS1) 10Vgutierrez: Remove digicert-2020 from upload/haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/739604 (https://phabricator.wikimedia.org/T289507) [18:51:29] (03CR) 10Vgutierrez: [C: 03+2] Remove digicert-2020 from upload/haproxy nodes [puppet] - 10https://gerrit.wikimedia.org/r/739604 (https://phabricator.wikimedia.org/T289507) (owner: 10Vgutierrez) [18:56:05] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:56:28] !log pool cp2042 (upload) running HAProxy as TLS terminator - T290005 [18:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:32] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [18:58:02] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp2042.codfw.wmnet with OS buster [18:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:14] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2042.codfw.wmnet with OS buster c... [18:58:33] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:00:04] jeena and dduvall: #bothumor I � Unicode. All rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211117T1900). [19:00:04] RoanKattouw and Urbanecm: May I have your attention please! UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211117T1900) [19:00:04] nn1l2 and MatmaRex: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:12] I can deploy today! [19:00:20] hello nn1l2 and MatmaRex [19:00:22] Hi [19:00:45] hi [19:01:25] (03CR) 10Urbanecm: [C: 03+2] Make reply tool available as opt-out on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739556 (https://phabricator.wikimedia.org/T295838) (owner: 10Bartosz Dziewoński) [19:02:40] (03Merged) 10jenkins-bot: Make reply tool available as opt-out on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739556 (https://phabricator.wikimedia.org/T295838) (owner: 10Bartosz Dziewoński) [19:02:49] (03PS3) 10Urbanecm: Disable local file upload on the Chinese Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738550 (https://phabricator.wikimedia.org/T295265) (owner: 104nn1l2) [19:02:52] (03CR) 10Urbanecm: [C: 03+2] Disable local file upload on the Chinese Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738550 (https://phabricator.wikimedia.org/T295265) (owner: 104nn1l2) [19:03:10] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) p:05Triage→03High [19:03:24] MatmaRex: hello, mwdebug1001 has your patch now. Can you test please? [19:03:41] (03Merged) 10jenkins-bot: Disable local file upload on the Chinese Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738550 (https://phabricator.wikimedia.org/T295265) (owner: 104nn1l2) [19:04:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10elukey) [19:04:25] urbanecm: eah. looks good [19:04:29] yeah* [19:04:31] thansk, syncing [19:05:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10elukey) @Jclark-ctr thanks for the patience! We'd like to call the nodes `dse-k8s-worker100[1-4]`, let me know if this is viable (`dse` in this case me... [19:05:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:56] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7b3a1d976cb1ef931c809b3670fb8c8b3f3a56e7: Make reply tool available as opt-out on commonswiki (T295838) (duration: 01m 05s) [19:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:59] T295838: Config change: Deploy Reply Tool as opt-out preference at Commons - https://phabricator.wikimedia.org/T295838 [19:06:03] MatmaRex: live. Anything else from you? [19:06:04] thanks [19:06:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10elukey) [19:06:22] nn1l2: your patch is at mwdebug1001, can you test please? [19:07:46] OK [19:08:07] It's okay, confirmed [19:08:33] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:07] urbanecm: looks good to me [19:09:14] thanks, syncing [19:09:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:39] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8e167a53cec3c3b216100bab686f28e09c424435: Disable local file upload on the Chinese Wikisource (T295265) (duration: 01m 05s) [19:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:42] T295265: Disable local file upload for Chinese Wikisource - https://phabricator.wikimedia.org/T295265 [19:10:46] nn1l2: it's liven ow [19:10:49] *now [19:10:58] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:34] Thanks, urbanecm [19:11:37] np [19:14:21] (03CR) 10Ladsgroup: [C: 03+2] export: Ignore rev_page_id index [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739491 (https://phabricator.wikimedia.org/T285149) (owner: 10Ladsgroup) [19:15:35] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:23] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 974.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:16:32] (03PS1) 10Cathal Mooney: Add role in site.pp for new rpki2002 VM [puppet] - 10https://gerrit.wikimedia.org/r/739609 (https://phabricator.wikimedia.org/T292503) [19:17:23] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:17:35] PROBLEM - MariaDB Replica Lag: s8 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1036.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:18:28] (03CR) 10CDanis: [C: 03+1] Add role in site.pp for new rpki2002 VM [puppet] - 10https://gerrit.wikimedia.org/r/739609 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney) [19:19:01] (03CR) 10Cathal Mooney: [C: 03+2] Add role in site.pp for new rpki2002 VM [puppet] - 10https://gerrit.wikimedia.org/r/739609 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney) [19:19:19] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:20:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:49] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) For the record, there is also a link to lvs2007, after chatting with @bblack on irc, the usual `disable puppet then stop pybal` is to do bef... [19:24:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:22] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb={DELETE,LIST} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:26:26] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:28:38] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:30:04] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10MSantos) @Dzahn and @ssastry I can't access `scandium.eqiad.wmnet` and `testreduce1001.eqiad.wmnet`. My shell user is `mbsantos`. [19:30:52] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10ayounsi) [19:33:22] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Urbanecm) >>! In T295900#7511306, @MSantos wrote: > @Dzahn and @ssastry I can't access `scandium.eqiad.wmnet` and `testreduce1001.eqi... [19:33:54] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [19:34:17] (03PS1) 10Cathal Mooney: Changing glob pattern for partman receipe for rpki VMs [puppet] - 10https://gerrit.wikimedia.org/r/739611 (https://phabricator.wikimedia.org/T292503) [19:36:03] (03Merged) 10jenkins-bot: export: Ignore rev_page_id index [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739491 (https://phabricator.wikimedia.org/T285149) (owner: 10Ladsgroup) [19:36:14] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10MSantos) @Urbanecm I had the impression this task is for all [[ https://www.mediawiki.org/wiki/Content_Transform_Team | Content Transf... [19:36:56] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Urbanecm) >>! In T295900#7511332, @MSantos wrote: > @Urbanecm I had the impression this task is for all [[ https://www.mediawiki.org/w... [19:42:09] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10cmooney) 05Open→03Resolved Ok both VMs have been rebuilt with 20GB disk and updated to version 0.10.2. rpki1001 remains with the same name, r... [19:42:12] 10ops-eqdfw, 10DC-Ops: eqdfw:pdus - https://phabricator.wikimedia.org/T295921 (10RobH) [19:42:31] 10ops-eqdfw, 10DC-Ops: eqdfw:pdus - https://phabricator.wikimedia.org/T295921 (10RobH) [19:42:49] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/includes/export/WikiExporter.php: Backport: [[gerrit:739491|export: Ignore rev_page_id index (T285149)]] (duration: 01m 04s) [19:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:53] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [19:44:10] PROBLEM - ganeti-mond running on ganeti6004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [19:44:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:02] PROBLEM - ganeti-confd running on ganeti6004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [19:45:08] PROBLEM - ganeti-noded running on ganeti6004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [19:46:52] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:30] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:04] jeena and dduvall: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211117T2000). [20:02:12] I will start the deployment in a few minutes [20:05:00] ACKNOWLEDGEMENT - configured eth on ganeti6001 is CRITICAL: public reporting no carrier. Brandon Black interface public will continue to show no link until we have some public instances created in these clusters, i think. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:05:00] ACKNOWLEDGEMENT - configured eth on ganeti6002 is CRITICAL: public reporting no carrier. Brandon Black interface public will continue to show no link until we have some public instances created in these clusters, i think. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:05:00] ACKNOWLEDGEMENT - configured eth on ganeti6003 is CRITICAL: public reporting no carrier. Brandon Black interface public will continue to show no link until we have some public instances created in these clusters, i think. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:05:00] ACKNOWLEDGEMENT - configured eth on ganeti6004 is CRITICAL: public reporting no carrier. Brandon Black interface public will continue to show no link until we have some public instances created in these clusters, i think. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:17:29] (03PS1) 10PipelineBot: image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/739618 [20:18:39] (03CR) 10Herron: [C: 03+2] rsyslog: switch codfw TLS remote syslog destination to centrallog2002 [puppet] - 10https://gerrit.wikimedia.org/r/734405 (https://phabricator.wikimedia.org/T292196) (owner: 10Herron) [20:18:50] (03PS1) 10Majavah: hieradata: fix codfw1dev ntp server [puppet] - 10https://gerrit.wikimedia.org/r/739619 [20:22:40] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.9 refs T293950 [20:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:44] T293950: 1.38.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T293950 [20:23:44] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.9 refs T293950 (duration: 01m 03s) [20:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:26] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet - https://phabricator.wikimedia.org/T295563 (10Papaul) @MatthewVernon thank you, this will do it [20:24:53] (03PS1) 10Ahmon Dancy: beta::autoupdater Don't mess with ${stage_dir}/php-master/cache/l10n [puppet] - 10https://gerrit.wikimedia.org/r/739620 (https://phabricator.wikimedia.org/T295304) [20:26:14] jeena: has that impacted https://phabricator.wikimedia.org/T295543 [20:26:41] Hopefully it's stopped [20:27:24] I don't see that in the logs atm but there are a bunch of other errors so I may roll back [20:31:33] jeena: that error should have stopped so not seeing it is good [20:31:40] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking): Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Jgiannelos) I also don't have access to `scandium` and `testreduce1001`. Similar with Mateus, I am new to the content transform team. [20:33:11] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.38.0-wmf.7" [20:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:57] (03CR) 10Ahmon Dancy: "Already cherry-picked in deployment-puppetmaster04" [puppet] - 10https://gerrit.wikimedia.org/r/739620 (https://phabricator.wikimedia.org/T295304) (owner: 10Ahmon Dancy) [20:34:00] (03PS1) 10Jeena Huneidi: group1 wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739621 [20:34:02] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739621 (owner: 10Jeena Huneidi) [20:34:04] (03PS1) 10Jeena Huneidi: Revert "group1 wikis to 1.38.0-wmf.9 refs T293950" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739622 [20:34:06] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group1 wikis to 1.38.0-wmf.9 refs T293950" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739622 (owner: 10Jeena Huneidi) [20:34:19] hmm [20:34:21] that's weird [20:34:50] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.9 refs T293950 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739621 (owner: 10Jeena Huneidi) [20:34:53] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.38.0-wmf.9 refs T293950" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739622 (owner: 10Jeena Huneidi) [20:37:25] (03PS1) 10Dzahn: admin: add mbsantos and jgiannalos to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/739623 (https://phabricator.wikimedia.org/T295900) [20:38:00] (03PS2) 10Dzahn: admin: add mbsantos and jgiannelos to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/739623 (https://phabricator.wikimedia.org/T295900) [20:38:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:09] (03CR) 10Dzahn: [C: 03+2] "per IRC chat: we are starting with the "admins" group" [puppet] - 10https://gerrit.wikimedia.org/r/739623 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [20:41:28] (03CR) 10Dzahn: "this is just scandium and testreduce1001, not wtp/parse (prod parsoid)" [puppet] - 10https://gerrit.wikimedia.org/r/739623 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [20:42:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:13] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) @MSantos @Jgiannelos Try again now:) Since you had existing shell users it was in this case just adding... [20:45:03] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:54] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) This group gives you the following sudo privileges: ` 654 privileges: ['ALL = NOPASSWD: /usr/sbin/... [20:53:25] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) [20:53:54] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) [20:54:16] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) added the team members from https://www.mediawiki.org/wiki/Content_Transform_Team to have their own check... [20:55:34] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) All of the mgmt switches in row A have been replaced, netbox has been updated, the interface connection to msw1-a-eqiad has been updated in netbox. The old management... [20:56:52] duesen, Pchelolo: https://phabricator.wikimedia.org/T295930 looks like a cache issue [20:56:58] You flagged a cache related risky patch [20:57:50] jeena: if not done, it says raise alarm on slack's #platform-engineering-team [20:57:54] thanks RhinosF1 - I posted on slack as recommended but maybe it will get noticed on IRC [20:58:05] :) [20:58:20] RhinosF1: could you remind me which risky patch we flagged? :) [20:58:31] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/699067 [20:58:37] Pchelolo: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/699067/ [20:58:59] Train got rolled back from .9 to .8 [20:59:05] right.. without a deployment last week I don't remember what we did anymore. [20:59:06] So is it backwards incompatible [20:59:08] will have a look [20:59:42] I'll make a task for the original error as well [20:59:52] Sounds good [21:00:04] jeena and dduvall: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211117T2000). Please do the needful. [21:00:04] chrisalbon and accraze: That opportune time is upon us again. Time for a Services – Graphoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211117T2100). [21:00:32] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: fix codfw1dev ntp server [puppet] - 10https://gerrit.wikimedia.org/r/739619 (owner: 10Majavah) [21:00:51] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:12] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:04:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:04:11] jeena: this does not seem to have anything to do with the risky patch we've marked [21:04:11] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10ihurbain) @Dzahn I read and agreed and signed the L3. Here's my brand new public key: `ssh-ed25519 AAAAC3NzaC1l... [21:04:17] Pchelolo: I rolled back due to this: https://phabricator.wikimedia.org/T295931 [21:04:43] This has everythin to do with the risky patch :) [21:04:50] hehe [21:05:07] so maybe there was something else that was backwards incompatible [21:06:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:07:14] (03PS1) 10Dzahn: admin: let parsoid-test-admins run 'sudo mysql..' on test servers [puppet] - 10https://gerrit.wikimedia.org/r/739647 (https://phabricator.wikimedia.org/T295900) [21:07:16] the array to string conversion errors seem to be stopping now though [21:07:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:50] (03PS2) 10Dzahn: admin: let parsoid-test-admins run 'sudo mysql..' on test servers [puppet] - 10https://gerrit.wikimedia.org/r/739647 (https://phabricator.wikimedia.org/T295900) [21:09:22] (03CR) 10Dzahn: "affects only 2 test servers, not touching prod parsoid" [puppet] - 10https://gerrit.wikimedia.org/r/739647 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [21:09:36] jeena: maybe it was a very short cache [21:13:31] maybe [21:13:56] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Access to scandium.eqiad.wmnet & testreduce1001.eqiad.wmnet - https://phabricator.wikimedia.org/T295900 (10Dzahn) @ihurbain Hey Isabelle, thank you. confirmed signature:) And yes, that is a public key and looks good to... [21:18:43] jeena: I see some weirdness on https://meta.wikimedia.org/w/index.php?title=Special:Log&logid=44476939, but I think that will get fixed after train is rolled back to group1 [21:19:09] (03PS2) 10Ebernhardson: Add CirrusSearch Old GC Hell alerting [alerts] - 10https://gerrit.wikimedia.org/r/739034 (https://phabricator.wikimedia.org/T290604) [21:19:11] (03CR) 10Ebernhardson: Add CirrusSearch Old GC Hell alerting (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/739034 (https://phabricator.wikimedia.org/T290604) (owner: 10Ebernhardson) [21:19:18] (03PS1) 10Dzahn: admin: upgrade ihurbain from ldap_only to shell, add to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/739648 (https://phabricator.wikimedia.org/T295900) [21:20:27] there's a CentralAuth logging patch in this train [21:20:48] RECOVERY - MariaDB Replica Lag: s8 on db1171 is OK: OK slave_sql_lag Replication lag: 0.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:20:49] thanks for the notice :) [21:22:25] .. this might even be the cause of https://phabricator.wikimedia.org/T295930 [21:22:52] (03PS2) 10Dzahn: admin: upgrade ihurbain from ldap_only to shell, add to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/739648 (https://phabricator.wikimedia.org/T295900) [21:23:07] oh hmm [21:23:55] (03CR) 10Dzahn: "Since she has existing LDAP access this means moving from the ldap_only section to the shell user section." [puppet] - 10https://gerrit.wikimedia.org/r/739648 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [21:23:59] (03PS3) 10Ebernhardson: Add CirrusSearch Old GC Hell alerting [alerts] - 10https://gerrit.wikimedia.org/r/739034 (https://phabricator.wikimedia.org/T290604) [21:28:35] (03CR) 10Cwhite: "Deploying this right now will cause rsyslog to no longer relay messages to kafka, but I assume that's expected." [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [21:28:51] left a comment to the task [21:29:05] (03CR) 10Subramanya Sastry: [C: 03+1] admin: upgrade ihurbain from ldap_only to shell, add to parsoid-test-admins [puppet] - 10https://gerrit.wikimedia.org/r/739648 (https://phabricator.wikimedia.org/T295900) (owner: 10Dzahn) [21:29:12] (03CR) 10Cwhite: [C: 03+2] profile: remove namespace_name constraint from shellbox filter [puppet] - 10https://gerrit.wikimedia.org/r/739572 (owner: 10Cwhite) [21:29:29] sorry about that! [21:30:27] Thanks for looking into it! I guess that means once the other blocker is resolved we can roll forward [21:31:23] yeah, I think it should be harmless minus a few log pages that look weird until we roll forward [21:32:21] I'll head to bed, unless you have something specific that I can help with right now [21:34:58] I don't think so. Have a good night [21:37:10] (03PS11) 10Brennen Bearnes: gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [21:37:52] (03PS12) 10Brennen Bearnes: gitlab-runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [21:41:31] !log ayounsi@deploy1002 Started deploy [homer/deploy@dc007aa]: Homer CR738905 [21:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:57] !log ayounsi@deploy1002 Finished deploy [homer/deploy@dc007aa]: Homer CR738905 (duration: 01m 27s) [21:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:10] (03CR) 10Dzahn: [V: 03+1] acme_chief: convert cron to restart service to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:47:05] (03CR) 10Dzahn: [V: 03+1] acme_chief: convert cron to restart service to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:50:41] (03CR) 10Ladsgroup: [C: 03+1] acme_chief: convert cron to restart service to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:54:47] (03CR) 10Dzahn: [V: 03+1] acme_chief: convert cron to restart service to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:58:41] (03CR) 10Dzahn: [V: 03+1] acme_chief: convert cron to restart service to timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [22:01:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q2:(Need By: TBD) rack/setup/install civi1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T292767 (10Jclark-ctr) [22:01:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q2:(Need By: TBD) rack/setup/install civi1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T292767 (10Jclark-ctr) Racked host civi1002 C1 U33 [22:02:51] 10SRE, 10Data-Persistence (Consultation), 10Performance-Team, 10Wikimedia-Rdbms, 10Sustainability (Incident Followup): Reimplement HHVM-like slow query log - https://phabricator.wikimedia.org/T293534 (10Krinkle) [22:04:44] 10SRE, 10Data-Persistence (Consultation), 10Performance-Team, 10Wikimedia-Rdbms, 10Sustainability (Incident Followup): Reimplement HHVM-like slow query log - https://phabricator.wikimedia.org/T293534 (10Krinkle) Tentatively merging the two as it seems work is progressing at T295706 to fulfill the same ne... [22:05:06] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Krinkle) [22:05:25] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Krinkle) [22:42:18] !log miscweb1002/2002 - moved /srv/deployment/scholarships to /root/ (T243037) [22:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:22] T243037: Shutdown scholarships.wikimedia.org and archive project - https://phabricator.wikimedia.org/T243037 [22:48:01] (03PS1) 10Dzahn: wikimania_scholarships: delete module and profile, remove from miscweb [puppet] - 10https://gerrit.wikimedia.org/r/739658 (https://phabricator.wikimedia.org/T243037) [22:49:25] (03PS1) 10Dzahn: cache::text: remove config for scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/739660 (https://phabricator.wikimedia.org/T243037) [22:49:28] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) [22:51:10] (03PS1) 10Dzahn: logstash: remove scholarships type from udp2log filters [puppet] - 10https://gerrit.wikimedia.org/r/739662 (https://phabricator.wikimedia.org/T243037) [22:52:01] (03PS2) 10Dzahn: logstash: remove scholarships type from udp2log filters [puppet] - 10https://gerrit.wikimedia.org/r/739662 (https://phabricator.wikimedia.org/T243037) [22:53:02] (03PS1) 10Dzahn: deployment_server: remove scholarships [puppet] - 10https://gerrit.wikimedia.org/r/739663 [22:57:09] (03PS1) 10Dzahn: mariadb: remove all grants related to scholarship app and its dumps [puppet] - 10https://gerrit.wikimedia.org/r/739667 (https://phabricator.wikimedia.org/T243037) [23:05:14] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for - https://phabricator.wikimedia.org/T295898 (10Aklapper) [23:09:12] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [23:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:01] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for - https://phabricator.wikimedia.org/T295898 (10CGlenn) @Aklapper Hello! I closed this ticket because I need to help Olga & Brooke create wikitech account. Unless the MediaWiki log-in will work? [23:32:37] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) [23:34:45] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) [23:35:35] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) [23:35:51] !log legoktm@cumin1001 conftool action : set/weight=10; selector: name=thumbor1005.eqiad.wmnet [23:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:12] (03PS3) 10Ryan Kemper: elasticsearch: disallow puppet to restart [puppet] - 10https://gerrit.wikimedia.org/r/739379 (https://phabricator.wikimedia.org/T290902) [23:39:01] (03PS1) 10Legoktm: thumbor: Add thumbor1006 [puppet] - 10https://gerrit.wikimedia.org/r/739673 (https://phabricator.wikimedia.org/T285477) [23:39:03] (03PS1) 10Legoktm: conftool: Add thumbor1006 [puppet] - 10https://gerrit.wikimedia.org/r/739674 (https://phabricator.wikimedia.org/T285477) [23:39:30] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) [23:40:07] (03CR) 10Legoktm: [C: 03+2] thumbor: Add thumbor1006 [puppet] - 10https://gerrit.wikimedia.org/r/739673 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [23:41:44] (03PS1) 10Dzahn: miscweb: enable TLS, fix public port in defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/739675 (https://phabricator.wikimedia.org/T281538) [23:43:00] (03CR) 10Legoktm: [C: 03+2] conftool: Add thumbor1006 [puppet] - 10https://gerrit.wikimedia.org/r/739674 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [23:43:19] (03PS2) 10Dzahn: miscweb: enable TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/739675 (https://phabricator.wikimedia.org/T281538) [23:43:41] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor1006.eqiad.wmnet [23:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:00] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor1006.eqiad.wmnet [23:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:15] !log legoktm@cumin1001 conftool action : set/weight=5; selector: name=thumbor1006.eqiad.wmnet [23:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:50] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor1001.eqiad.wmnet [23:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:33] 10SRE, 10SRE Observability, 10Wikimedia-Logstash, 10observability: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10colewhite) [23:49:23] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor1001.eqiad.wmnet [23:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:19] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor1003.eqiad.wmnet [23:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:59] confusingly, thumbor100[34] are older than thumbor100[12] [23:54:59] (03PS1) 10Legoktm: Move thumbor2005 and thumbor2006 to thumbor::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/739677 (https://phabricator.wikimedia.org/T285477) [23:56:05] (03PS3) 10Zabe: Test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735745 [23:56:30] (03PS2) 10Zabe: Migrate wmfHostnames to wmgHostnames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734574 (https://phabricator.wikimedia.org/T45956) [23:59:02] (03PS4) 10Zabe: Lossless optimization of the brwikimedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735745