[00:01:12] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Redirect old /mailman/options/ urls - https://phabricator.wikimedia.org/T286267 (10Platonides) That was fast :) [01:53:47] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Internal server error (with ugly html tags) when changing Autoresponse postings text - https://phabricator.wikimedia.org/T286269 (10Legoktm) Filed https://gitlab.com/mailman/mailman/-/issues/925 Submitted https://gitlab.com/mailman/mailman/-/merge_requests/888 [02:06:28] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:07:16] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:04:32] PROBLEM - Check whether ferm is active by checking the default input chain on labstore1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [05:35:26] RECOVERY - Check whether ferm is active by checking the default input chain on labstore1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:08:11] 10SRE, 10MediaWiki-Cache, 10Platform Engineering, 10Patch-For-Review: APCu caches are set to expire in 2073 instead of an hour if exptime is a unix timestamp - https://phabricator.wikimedia.org/T286260 (10Joe) I strongly doubt a TTL of a WEEK would change anything compared to the current situation - appser... [06:19:26] (03PS1) 10Marostegui: db1110: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/703538 [06:20:06] (03CR) 10Marostegui: [C: 03+2] db1110: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/703538 (owner: 10Marostegui) [06:29:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: fix networkpolicy ports definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/703449 (owner: 10Giuseppe Lavagetto) [06:31:42] (03Merged) 10jenkins-bot: mwdebug: fix networkpolicy ports definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/703449 (owner: 10Giuseppe Lavagetto) [06:59:25] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:04] Deploy window No deploys all week! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210707T0700) [07:03:58] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:51] PROBLEM - Disk space on backup2003 is CRITICAL: DISK CRITICAL - free space: /srv/bacula 2039187 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2003&var-datasource=codfw+prometheus/ops [07:18:26] ^ I will check that later [07:19:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/702886 (owner: 10Volans) [07:26:26] 10SRE, 10ops-codfw: mgmt on logstash2021 inaccessible - https://phabricator.wikimedia.org/T286274 (10MoritzMuehlenhoff) [07:26:35] 10SRE, 10ops-codfw: mgmt on logstash2021 inaccessible - https://phabricator.wikimedia.org/T286274 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:26:57] ACKNOWLEDGEMENT - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds Muehlenhoff T286274 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:32:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [07:37:59] <_joe_> uhm I have no idea what to do with that alert ^^ [07:38:10] <_joe_> moritzm, jbond any idea? [07:42:57] TTBOMK I don't think it's really actionable, other than using it as a hint to fine-tune our network link capacities [07:43:31] there's an ongoing maintenance by Telia [07:44:16] so probably some other link we have is currently overusing the bandwidth it would normally use [07:45:10] but we should probably improve the link so that it at least tells which link above quota or point to some dashboard [07:50:07] the specific link is included in the email to noc@ [07:51:13] <_joe_> kill -HUP legoktm [07:51:35] lolol [07:52:47] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [07:56:15] !log bounced elasticsearch_5@production-logstash-eqiad on logstash1009 [07:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:37] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:13] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:16:01] (03CR) 10Muehlenhoff: "Looks good to me, one nit inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [08:20:33] PROBLEM - Disk space on backup2003 is CRITICAL: DISK CRITICAL - free space: /srv/bacula 2011440 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2003&var-datasource=codfw+prometheus/ops [08:21:31] 10SRE, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to analytics cluster for Ben Tullis - https://phabricator.wikimedia.org/T285754 (10MoritzMuehlenhoff) 05Open→03Resolved Given that Ben is in root users and has cn=ops/cn=wmf LDAP membership this seems complete, closing the task so th... [08:22:34] 10SRE, 10User-MoritzMuehlenhoff: Sort out which RAID packages are still needed - https://phabricator.wikimedia.org/T216043 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:44:48] (03PS1) 10WMDE-Fisch: Enable template search improvements on first wikis 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703566 (https://phabricator.wikimedia.org/T284553) [08:44:50] (03PS1) 10WMDE-Fisch: Enable template search improvements on first wikis 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) [08:51:28] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Ladsgroup) This sounds like that was another case in mw1331: https://logstash.wikimedia.org/goto/07c52e2c90e05449eb7acc4df15c0cb5 [08:55:15] *arg* no deployments this week 😱 ... why did I miss that? [09:01:30] (03CR) 10Svantje Lilienthal: "I think french wikipedia is missing. It is in the ticket but I do not see it in the list." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703566 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [09:01:53] <_joe_> WMDE-Fisch: yeah sorry, we're on very reduced personnel this week [09:05:24] (03CR) 10Ladsgroup: [C: 03+1] "It has my virtual blessing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [09:10:34] (03PS2) 10WMDE-Fisch: Enable template search improvements on first wikis 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703566 (https://phabricator.wikimedia.org/T284553) [09:11:36] (03PS6) 10Jbond: C:rsync::server: convert to concat [puppet] - 10https://gerrit.wikimedia.org/r/703452 (https://phabricator.wikimedia.org/T205618) [09:12:40] (03PS1) 10WMDE-Fisch: Enable transclusion back button on first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703568 (https://phabricator.wikimedia.org/T284553) [09:13:38] (03CR) 10WMDE-Fisch: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703566 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [09:18:32] (03PS4) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [09:21:57] (03CR) 10jerkins-bot: [V: 04-1] sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [09:23:51] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:23:57] 10SRE, 10MW-on-K8s, 10serviceops, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10Joe) [09:25:33] (03PS5) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [09:26:11] (03CR) 10Jbond: "> Patch Set 3:" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [09:26:19] (03PS6) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [09:29:13] (03CR) 10jerkins-bot: [V: 04-1] sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [09:29:33] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:29:38] (03PS1) 10Jbond: cloud - hiera: add defaults to pki project [puppet] - 10https://gerrit.wikimedia.org/r/703570 [09:31:27] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:34:53] (03CR) 10Jbond: [C: 03+2] cloud - hiera: add defaults to pki project [puppet] - 10https://gerrit.wikimedia.org/r/703570 (owner: 10Jbond) [09:35:01] (03PS2) 10Jbond: cloud - hiera: add defaults to pki project [puppet] - 10https://gerrit.wikimedia.org/r/703570 [09:35:10] (03CR) 10Jbond: [V: 03+2 C: 03+2] cloud - hiera: add defaults to pki project [puppet] - 10https://gerrit.wikimedia.org/r/703570 (owner: 10Jbond) [09:37:08] (03PS7) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [09:46:43] (03PS1) 10Muehlenhoff: Deploy systemd-login logout.d script fleet-wide [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) [09:48:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:09:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one remaining nit inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [10:12:35] (03CR) 10Svantje Lilienthal: [C: 03+1] Enable transclusion back button on first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703568 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [10:13:35] (03CR) 10Jbond: [C: 03+1] "lgtm" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/701876 (owner: 10David Caro) [10:13:39] (03CR) 10Svantje Lilienthal: [C: 03+1] "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703566 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [10:14:54] (03CR) 10Svantje Lilienthal: "Why did you put this in a separate patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [10:14:56] (03PS8) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [10:14:58] (03CR) 10Jbond: sre.idm.logout: create cookbook to logout users (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [10:15:08] (03PS9) 10Jbond: sre.idm.logout: create cookbook to logout users [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 [10:18:57] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable template search improvements on first wikis 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [10:19:29] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable template search improvements on first wikis 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703566 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [10:20:28] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable transclusion back button on first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703568 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [10:22:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/701498 (owner: 10Jbond) [10:24:52] (03CR) 10Jbond: "See inline nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:25:05] (03CR) 10WMDE-Fisch: "> Patch Set 1:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [10:28:31] (03PS1) 10Giuseppe Lavagetto: mwdebug: constrain mwdebug to run on a single node [deployment-charts] - 10https://gerrit.wikimedia.org/r/703579 [10:35:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2087:3316', diff saved to https://phabricator.wikimedia.org/P16778 and previous config saved to /var/cache/conftool/dbconfig/20210707-103553-marostegui.json [10:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2087:3316 (re)pooling @ 25%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16779 and previous config saved to /var/cache/conftool/dbconfig/20210707-103638-root.json [10:36:41] 10SRE, 10vm-requests: eqiad/codfw: 1 of VMs requested for MX - https://phabricator.wikimedia.org/T286208 (10jbond) LGTM [10:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:49] (03CR) 10Muehlenhoff: Deploy systemd-login logout.d script fleet-wide (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:36:51] (03PS2) 10Muehlenhoff: Deploy systemd-login logout.d script fleet-wide [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) [10:37:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:51:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2087:3316 (re)pooling @ 50%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16780 and previous config saved to /var/cache/conftool/dbconfig/20210707-105142-root.json [10:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:00] (03CR) 10Jbond: Deploy systemd-login logout.d script fleet-wide (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [10:59:44] (03PS1) 10Jbond: cfss;::cert: add ability to ignore legacy CN certificates [puppet] - 10https://gerrit.wikimedia.org/r/703580 (https://phabricator.wikimedia.org/T283840) [11:00:11] (03CR) 10jerkins-bot: [V: 04-1] cfss;::cert: add ability to ignore legacy CN certificates [puppet] - 10https://gerrit.wikimedia.org/r/703580 (https://phabricator.wikimedia.org/T283840) (owner: 10Jbond) [11:03:02] (03PS3) 10Muehlenhoff: Deploy systemd-login logout.d script fleet-wide [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) [11:05:04] (03PS2) 10Jbond: cfss;::cert: add ability to ignore legacy CN certificates [puppet] - 10https://gerrit.wikimedia.org/r/703580 (https://phabricator.wikimedia.org/T283840) [11:05:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [11:05:31] (03CR) 10jerkins-bot: [V: 04-1] cfss;::cert: add ability to ignore legacy CN certificates [puppet] - 10https://gerrit.wikimedia.org/r/703580 (https://phabricator.wikimedia.org/T283840) (owner: 10Jbond) [11:05:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30127/console" [puppet] - 10https://gerrit.wikimedia.org/r/703580 (https://phabricator.wikimedia.org/T283840) (owner: 10Jbond) [11:06:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2087:3316 (re)pooling @ 75%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16781 and previous config saved to /var/cache/conftool/dbconfig/20210707-110645-root.json [11:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/703571 (https://phabricator.wikimedia.org/T283242) (owner: 10Muehlenhoff) [11:21:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2087:3316 (re)pooling @ 100%: Repool after index change', diff saved to https://phabricator.wikimedia.org/P16782 and previous config saved to /var/cache/conftool/dbconfig/20210707-112149-root.json [11:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:53] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host mx2002.wikimedia.org [11:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:13] (03PS3) 10Jbond: cfss;::cert: add ability to ignore legacy CN certificates [puppet] - 10https://gerrit.wikimedia.org/r/703580 (https://phabricator.wikimedia.org/T283840) [11:35:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30128/console" [puppet] - 10https://gerrit.wikimedia.org/r/703580 (https://phabricator.wikimedia.org/T283840) (owner: 10Jbond) [11:37:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfss;::cert: add ability to ignore legacy CN certificates [puppet] - 10https://gerrit.wikimedia.org/r/703580 (https://phabricator.wikimedia.org/T283840) (owner: 10Jbond) [11:43:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host mx2002.wikimedia.org [11:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host mx1002.wikimedia.org [11:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:16] (03PS1) 10Jbond: P:pki:multirootca: Add ACL so promethous nodes can scrap data [puppet] - 10https://gerrit.wikimedia.org/r/703589 (https://phabricator.wikimedia.org/T283840) [12:05:25] (03PS2) 10Jbond: P:pki::multirootca: Add ACL so prometheus nodes can scrap data [puppet] - 10https://gerrit.wikimedia.org/r/703589 (https://phabricator.wikimedia.org/T283840) [12:05:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host mx1002.wikimedia.org [12:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:34] (03CR) 10Zabe: [C: 03+1] zhwiktionary: Add templateeditor right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703482 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [12:09:07] (03CR) 10Zabe: [C: 03+1] zhwiktionary: Add namespaces: *118 - Reconstruction *119 - Reconstruction Talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703480 (https://phabricator.wikimedia.org/T286101) (owner: 10Tks4Fish) [12:12:02] !log Start server-side upload for 3 video files (T286173, T286175, T286174) [12:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:15] T286174: Server side upload for 고려 - https://phabricator.wikimedia.org/T286174 [12:12:16] T286173: Server side upload for 고려 - https://phabricator.wikimedia.org/T286173 [12:12:17] T286175: Server side upload for 고려 - https://phabricator.wikimedia.org/T286175 [12:14:40] (03PS3) 10Jbond: P:pki::multirootca: Add ACL so prometheus nodes can scrap data [puppet] - 10https://gerrit.wikimedia.org/r/703589 (https://phabricator.wikimedia.org/T283840) [12:15:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30131/console" [puppet] - 10https://gerrit.wikimedia.org/r/703589 (https://phabricator.wikimedia.org/T283840) (owner: 10Jbond) [12:26:05] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: Add ACL so prometheus nodes can scrap data [puppet] - 10https://gerrit.wikimedia.org/r/703589 (https://phabricator.wikimedia.org/T283840) (owner: 10Jbond) [12:57:00] (03PS1) 10Jbond: P:pki::multirootca: Don't require client auth for metricts endpoint [puppet] - 10https://gerrit.wikimedia.org/r/703596 (https://phabricator.wikimedia.org/T283840) [13:03:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30134/console" [puppet] - 10https://gerrit.wikimedia.org/r/703596 (https://phabricator.wikimedia.org/T283840) (owner: 10Jbond) [13:04:17] (03CR) 10Svantje Lilienthal: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [13:09:28] !log otto@deploy1002 Started deploy [analytics/refinery@8de71e6]: analytics cluster deploy for webrequest gobblin job migration - T271232 [13:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:39] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [13:11:08] (03CR) 10Ottomata: [C: 03+2] Bump eventgate image version to get normalized prometheus metric labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/703487 (https://phabricator.wikimedia.org/T272714) (owner: 10Ottomata) [13:12:39] !log otto@deploy1002 Finished deploy [analytics/refinery@8de71e6]: analytics cluster deploy for webrequest gobblin job migration - T271232 (duration: 03m 11s) [13:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:10] (03CR) 10Svantje Lilienthal: [C: 03+1] Enable template search improvements on first wikis 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [13:13:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: Don't require client auth for metricts endpoint [puppet] - 10https://gerrit.wikimedia.org/r/703596 (https://phabricator.wikimedia.org/T283840) (owner: 10Jbond) [13:13:28] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:13:29] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:36] (03PS2) 10Svantje Lilienthal: Enable template search improvements on first wikis 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703567 (https://phabricator.wikimedia.org/T284553) (owner: 10WMDE-Fisch) [13:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:10] (03CR) 10Giuseppe Lavagetto: Add Shellbox to {Production,Labs}Services.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703496 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [13:17:30] (03PS1) 10Jbond: Revert "P:pki::multirootca: Don't require client auth for metricts endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/703557 [13:18:56] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I think we might need to raise the timeout in the future, but this is good enough at this point." [puppet] - 10https://gerrit.wikimedia.org/r/703491 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [13:19:42] !log otto@deploy1002 Started deploy [analytics/refinery@8de71e6] (hadoop-test): analytics test cluster deploy for webrequest_test gobblin dir fixes - T271232 [13:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:51] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [13:20:52] (03PS2) 10Jbond: Revert "P:pki::multirootca: Don't require client auth for metricts endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/703557 [13:22:39] (03CR) 10Jbond: [C: 03+2] Revert "P:pki::multirootca: Don't require client auth for metricts endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/703557 (owner: 10Jbond) [13:24:35] (03PS1) 10Urbanecm: Add sayahna.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703600 (https://phabricator.wikimedia.org/T286163) [13:25:23] !log otto@deploy1002 Finished deploy [analytics/refinery@8de71e6] (hadoop-test): analytics test cluster deploy for webrequest_test gobblin dir fixes - T271232 (duration: 05m 41s) [13:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:32] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [13:32:18] (03PS2) 10Ottomata: Ensure absent webrequest camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/703474 (https://phabricator.wikimedia.org/T271232) [13:32:23] (03PS1) 10Urbanecm: enwiki: Delete Book namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703601 (https://phabricator.wikimedia.org/T285766) [13:33:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: constrain mwdebug to run on a single node [deployment-charts] - 10https://gerrit.wikimedia.org/r/703579 (owner: 10Giuseppe Lavagetto) [13:34:04] (03CR) 10Ottomata: [C: 03+2] Ensure absent webrequest camus jobs [puppet] - 10https://gerrit.wikimedia.org/r/703474 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:36:54] (03Merged) 10jenkins-bot: mwdebug: constrain mwdebug to run on a single node [deployment-charts] - 10https://gerrit.wikimedia.org/r/703579 (owner: 10Giuseppe Lavagetto) [13:37:15] PROBLEM - Host mw2267 is DOWN: PING CRITICAL - Packet loss = 100% [13:42:26] (03PS1) 10Ottomata: Update webrequest raw data purge job with new partition path format for gobblin [puppet] - 10https://gerrit.wikimedia.org/r/703603 (https://phabricator.wikimedia.org/T271232) [13:51:36] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/703603 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [13:53:27] (03CR) 10Ottomata: [C: 03+2] Update webrequest raw data purge job with new partition path format for gobblin [puppet] - 10https://gerrit.wikimedia.org/r/703603 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [14:00:10] <_joe_> I'm looking at mw2267 [14:05:41] <_joe_> !log powercycling mw2267, stuck witout network, blank console [14:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:37] RECOVERY - Host mw2267 is UP: PING OK - Packet loss = 0%, RTA = 31.60 ms [14:47:18] (03PS1) 10Giuseppe Lavagetto: mwdebug: sync to the latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/703609 [14:49:59] !log installing djvulibre security updates [14:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: sync to the latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/703609 (owner: 10Giuseppe Lavagetto) [15:18:49] (03Merged) 10jenkins-bot: mwdebug: sync to the latest image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/703609 (owner: 10Giuseppe Lavagetto) [15:19:49] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:13] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:00] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/703615 [15:37:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/703615 (owner: 10Giuseppe Lavagetto) [15:40:01] (03Merged) 10jenkins-bot: mediawiki: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/703615 (owner: 10Giuseppe Lavagetto) [15:55:38] (03PS1) 10Majavah: P::toolforge:apt_pinning: drop legacy paws k8s cluster pins [puppet] - 10https://gerrit.wikimedia.org/r/703618 [15:57:25] (03PS1) 10Majavah: toolforge: drop dedicated grid queues [puppet] - 10https://gerrit.wikimedia.org/r/703619 [16:01:29] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:20] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:16] <_joe_> first time our baby goes out to a production k8s cluster [16:05:04] !log joal@deploy1002 Started deploy [analytics/refinery@b5c4462] (hadoop-test): Analytics deploy for Gobblin replacing Camus - HADOOP-TEST [analytics/refinery@b5c4462] [16:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:20] they grow up so fast 🥲 [16:08:19] <_joe_> uhm crashloopbackoff :/ [16:09:48] <_joe_> oh just some mcrouter misconfiguration, my bad [16:15:26] !log joal@deploy1002 Finished deploy [analytics/refinery@b5c4462] (hadoop-test): Analytics deploy for Gobblin replacing Camus - HADOOP-TEST [analytics/refinery@b5c4462] (duration: 10m 21s) [16:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:30] (03PS1) 10Giuseppe Lavagetto: mwdebug: add production values for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/703620 [16:18:39] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: add production values for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/703620 (owner: 10Giuseppe Lavagetto) [16:18:47] (03PS2) 10Giuseppe Lavagetto: mwdebug: add production values for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/703620 [16:24:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: add production values for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/703620 (owner: 10Giuseppe Lavagetto) [16:26:58] (03Merged) 10jenkins-bot: mwdebug: add production values for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/703620 (owner: 10Giuseppe Lavagetto) [16:28:31] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:48] <_joe_> and now it works. [16:52:20] !log joal@deploy1002 Started deploy [analytics/refinery@b5c4462]: Analytics deploy for Gobblin replacing Camus - an-launcher1002 only [analytics/refinery@b5c4462] [16:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:30] !log joal@deploy1002 Finished deploy [analytics/refinery@b5c4462]: Analytics deploy for Gobblin replacing Camus - an-launcher1002 only [analytics/refinery@b5c4462] (duration: 03m 10s) [16:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:54] exciting :DD [17:31:42] (03CR) 10Legoktm: [C: 03+2] services_proxy: Add envoyproxy for shellbox [puppet] - 10https://gerrit.wikimedia.org/r/703491 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [17:34:04] `curl http://localhost:6024/healthz` works perfectly [17:36:38] !log otto@deploy1002 Started deploy [analytics/refinery@46c0b84]: Deploy for gobblin migration - Refine now supports gzip - T271232 [17:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:46] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [17:49:40] (03PS1) 10Ottomata: Gobblinize refine_netflow job [puppet] - 10https://gerrit.wikimedia.org/r/703623 (https://phabricator.wikimedia.org/T271232) [17:52:29] PROBLEM - k8s API server requests latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 verb={GET,LIST,PATCH,PUT} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:52:49] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:52:51] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:54:00] !log otto@deploy1002 Finished deploy [analytics/refinery@46c0b84]: Deploy for gobblin migration - Refine now supports gzip - T271232 (duration: 17m 22s) [17:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:08] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [17:55:53] (03PS2) 10Legoktm: Add Shellbox to {Production,Labs}Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703496 (https://phabricator.wikimedia.org/T281423) [17:55:55] (03PS6) 10Legoktm: Re-enable Score using Shellbox on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703489 (https://phabricator.wikimedia.org/T281423) [17:56:00] (03CR) 10Legoktm: Add Shellbox to {Production,Labs}Services.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703496 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [17:56:17] RECOVERY - k8s API server requests latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:56:37] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:56:37] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:56:44] (03CR) 10Legoktm: [C: 03+2] "No-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703495 (owner: 10Legoktm) [17:57:23] (03Merged) 10jenkins-bot: Document $wgShellboxSecretKey in private/readme.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703495 (owner: 10Legoktm) [17:58:30] !log otto@deploy1002 Started deploy [analytics/refinery@46c0b84] (hadoop-test): Deploy for gobblin migration - Refine now supports gzip - T271232 [17:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:21] (03CR) 10Ottomata: Gobblinize refine_netflow job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703623 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [17:59:32] !log legoktm@deploy1002 Synchronized private/readme.php: Document $wgShellboxSecretKey in private/readme.php (duration: 01m 01s) [17:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:06] (03CR) 10Joal: Gobblinize refine_netflow job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703623 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [18:01:53] (03CR) 10Legoktm: [C: 03+2] "Also no-op, these variables aren't in use yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703496 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [18:02:37] (03Merged) 10jenkins-bot: Add Shellbox to {Production,Labs}Services.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703496 (https://phabricator.wikimedia.org/T281423) (owner: 10Legoktm) [18:03:44] (03CR) 10Ottomata: Gobblinize refine_netflow job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703623 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [18:03:59] !log otto@deploy1002 Finished deploy [analytics/refinery@46c0b84] (hadoop-test): Deploy for gobblin migration - Refine now supports gzip - T271232 (duration: 05m 28s) [18:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:07] T271232: Replace Camus by Gobblin - https://phabricator.wikimedia.org/T271232 [18:04:25] legoktm: let me know if you want more things with this thing https://shellbox-beta.wmcloud.org/healthz [18:05:22] I'm going to poke at it in a bit :D [18:05:40] !log legoktm@deploy1002 Synchronized wmf-config/LabsServices.php: Add Shellbox to {Production,Labs}Services.php (1/2) (duration: 00m 59s) [18:05:41] thank you for taking care of what I thought would be a giant problem for me <3 [18:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:04] Amir1: legoktm: i'd rather have its internal traffic not go via the shared cloudvps proxy, let's set up a service dns name for it instead and possibly envoy for tls termination (we have zeroconf cfssl certs on deployment-prep that mediawiki should trust) [18:06:33] sure [18:06:53] !log legoktm@deploy1002 Synchronized wmf-config/ProductionServices.php: Add Shellbox to {Production,Labs}Services.php (2/2) (duration: 00m 59s) [18:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:33] majavah: it doesn't send anything sensitive over HTTP AFAIK [18:07:49] (03CR) 10Joal: Gobblinize refine_netflow job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/703623 (https://phabricator.wikimedia.org/T271232) (owner: 10Ottomata) [18:07:57] it does HMARC validation using the secret [18:08:31] but it should be okay to pass through "unsafe" routes, it's design of MAC [18:09:43] *HMAC [18:13:36] I think majavah's concern is just having extra unnecessary traffic go over the shared proxy, plus it being another failure point [18:13:57] legoktm: correct [18:14:55] what do we need to do for envoy? [18:15:22] i think that's as simple as applying some hiera, let me find an example for you [18:16:44] legoktm: should be as simple as https://horizon.wikimedia.org/project/prefixpuppet/?tab=prefix_puppet__puppet-deployment-ms-fe on the shellbox prefix / host [18:16:51] cfssl takes care of certs automatically [18:17:22] doing it on the mediawiki side is much more complicated as we don't have service::catalog set up on deployment-prep, no lvs there [18:17:37] (03CR) 10Zabe: [C: 03+1] Add sayahna.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703600 (https://phabricator.wikimedia.org/T286163) (owner: 10Urbanecm) [18:20:04] aah I see [18:20:06] okay then [18:22:33] majavah: oh, I guess you're saying we need 2 envoys [18:23:18] if it's not straightforward, we can skip envoy on the MW side, just a slight perf hit if PHP has to do the TLS stuff on each request [18:24:53] legoktm: doing envoy on deployment-prep mediawiki side has been on my to-do list on a while, but afaik not straightforward as we don't have lvs service ips on deployment-prep but the service::catalog hiera key (which is used to create envoy configs) needs them [18:30:44] gotcha [20:10:15] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:17:22] (03PS1) 10Legoktm: Revert "admin_state: depool eqiad for datacenter switchover (June 2021)" [dns] - 10https://gerrit.wikimedia.org/r/703561 [20:18:02] (03CR) 10RLazarus: [C: 03+1] Revert "admin_state: depool eqiad for datacenter switchover (June 2021)" [dns] - 10https://gerrit.wikimedia.org/r/703561 (owner: 10Legoktm) [20:20:43] (03PS2) 10Legoktm: Revert "admin_state: depool eqiad for datacenter switchover (June 2021)" [dns] - 10https://gerrit.wikimedia.org/r/703561 [20:21:45] (03CR) 10Legoktm: [C: 03+2] Revert "admin_state: depool eqiad for datacenter switchover (June 2021)" [dns] - 10https://gerrit.wikimedia.org/r/703561 (owner: 10Legoktm) [20:22:09] !log repooling eqiad - https://gerrit.wikimedia.org/r/703561 [20:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:43] * legoktm is watching https://grafana.wikimedia.org/d/000000093/varnish-traffic?orgId=1&from=now-1h&to=now [20:33:03] (03PS1) 10Legoktm: Re-depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/703562 [20:33:35] (03CR) 10RLazarus: [C: 03+1] Re-depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/703562 (owner: 10Legoktm) [20:34:15] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on alert1001 is CRITICAL: 43.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:34:42] ^ expected [21:02:45] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:11:03] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:29:58] (03Abandoned) 10Legoktm: Re-depool eqiad [dns] - 10https://gerrit.wikimedia.org/r/703562 (owner: 10Legoktm) [21:32:52] (03CR) 10Legoktm: "> Patch Set 1:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702421 (https://phabricator.wikimedia.org/T260297) (owner: 10Legoktm) [21:44:40] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Jclark-ctr) pc1011 [21:55:13] (03PS2) 10Legoktm: Merge db-codfw.php and db-eqiad.php into db-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702421 (https://phabricator.wikimedia.org/T260297) [21:59:32] (03CR) 10Legoktm: "I added inline documentation based on Krinkle's suggestion on how to have a wiki vary it's section based on datacenter." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702421 (https://phabricator.wikimedia.org/T260297) (owner: 10Legoktm) [22:04:19] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Jclark-ctr) pc1011 A1 u21 port 9 Cableid#3963 pc1012 B1 u28 port17 Cableid#3947 pc1013 C5 u26 port24 Cableid#3410 pc1014 D6 u36 port35 Cableid#23000064 [22:04:25] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Jclark-ctr) [22:51:12] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Jclark-ctr) pc1011 A1 u21 port 9 Cableid#3963 IP 10.65.1.187 pc1012 B1 u28 port17 Cableid#3947 IP 10.65.1.188 pc1013 C5 u26 port24 Cableid#3410 IP 10.65.1.189 pc1014 D6 u36 port35 Cabl... [22:51:55] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Jclark-ctr) [22:55:18] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pc1011-pc1014 - https://phabricator.wikimedia.org/T282484 (10Jclark-ctr) a:05Jclark-ctr→03RobH @RobH only script in netbox has been run. bios/drac/serial setup have been configured. And host have been powered off [23:03:25] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:17:03] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:35:11] (03CR) 10Krinkle: Merge db-codfw.php and db-eqiad.php into db-production.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702421 (https://phabricator.wikimedia.org/T260297) (owner: 10Legoktm) [23:45:24] 10SRE, 10MediaWiki-Parser, 10serviceops, 10Performance-Team (Radar): purgeParserCache.php: Cannot purge this kind of parser cache - https://phabricator.wikimedia.org/T250231 (10Krinkle) [23:52:26] (03PS1) 10Krinkle: Move parsercache DB config to *Services.php (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703629 [23:52:28] (03PS1) 10Krinkle: Move parsercache DB config to *Services.php (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703630 [23:52:30] (03PS1) 10Krinkle: Move parsercache DB config to *Services.php (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/703631 [23:53:10] (03CR) 10Krinkle: Merge db-codfw.php and db-eqiad.php into db-production.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/702421 (https://phabricator.wikimedia.org/T260297) (owner: 10Legoktm)