[00:23:40] 10SRE, 10Wikimedia-Mailing-lists: Make auditing members of mailing lists bound to a user right easier - https://phabricator.wikimedia.org/T286122 (10Platonides) mailman3 supports having an account with multiple emails. Requiring one of them (not necessarily the mail used in the mailing list) to match the wiki... [00:25:23] 10SRE, 10Wikimedia-Mailing-lists, 10Znuny, 10Chinese-Sites: Mailman cannot correctly decode GB2312-superset mails labelled as GB2312 (non-standard behavior) - https://phabricator.wikimedia.org/T173894 (10Platonides) [00:35:27] (03CR) 10Platonides: "The proper error code would be 551 («please use this other email instead»). Not that I am aware of any implementation using it, but it wou" [puppet] - 10https://gerrit.wikimedia.org/r/681242 (https://phabricator.wikimedia.org/T280472) (owner: 10Legoktm) [00:38:32] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10Platonides) This can't be //that// hard. @Legoktm do you want me to have a look at this? Doesn't seem to require any advenced permission, only on potd and ml, so... [02:04:04] PROBLEM - MariaDB memory on clouddb1019 is CRITICAL: CRIT Memory 98% used. Largest process: mysqld (9461) = 76.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:14:48] PROBLEM - MariaDB memory on clouddb1019 is CRITICAL: CRIT Memory 98% used. Largest process: mysqld (9461) = 76.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:36:54] PROBLEM - SSH on cp5011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:20:40] PROBLEM - MariaDB memory on clouddb1019 is CRITICAL: CRIT Memory 98% used. Largest process: mysqld (9461) = 76.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:37:42] RECOVERY - SSH on cp5011.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:06:52] metawiki stopped recording successful edits for abusefilters at 2021-07-01 19:35 per https://meta.wikimedia.org/wiki/Special:AbuseLog?wpSearchUser=&wpSearchPeriodStart=&wpSearchPeriodEnd=&wpSearchTitle=&wpSearchImpact=1&wpSearchAction=any&wpSearchActionTaken=&wpSearchFilter=&wpSearchWiki= [05:07:35] that seems more aligned with failed logging or recording; how can I check what was rolled at to metawiki at that time? [05:10:04] https://phabricator.wikimedia.org/T286140 [05:10:11] 19:35 Synchronized php-1.37.0-wmf.12/tests/phpunit/includes/TitleMethodsTest.php: Backport: [[gerrit:702711|Consistently normalize Title::mFragment before setting (T285951)]] (duration: 01m 10s) [05:10:12] T285951: Some section links in search results are redlinks - https://phabricator.wikimedia.org/T285951 [05:10:56] Synchronized php-1.37.0-wmf.12/includes/Title.php: Backport: [[gerrit:702711|Consistently normalize Title::mFragment before setting (T285951)]] (duration: 01m 10s) [05:11:53] everything is wmf.12 atm https://versions.toolforge.org/ [05:36:52] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:58:53] The fix is to revert a recent patch, so it can be safely self +2'ed - I've +2'ed the revert for master, once it merges will create a cherry pick for wmf.12. Is anyone available for an emergency deployment? [06:17:38] (03PS1) 10DannyS712: Revert "Replace depricating method IContextSource::getWikiPage to WikiPageFactory usage" [extensions/AbuseFilter] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702957 (https://phabricator.wikimedia.org/T286140) [06:17:49] ^ thats the emergency deployment needed [06:49:14] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:04] Deploy window No deploys all week! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210704T0700) [07:00:18] PROBLEM - SSH on cp5006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:49:04] 10SRE, 10ops-eqsin, 10Traffic, 10User-MediaJS: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10elukey) The problem seems to be fixed, and just to be sure: ` elukey@cr3-eqsin> show chassis environment pem PEM 0 status: State... [07:49:53] (03PS1) 10Elukey: Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/702959 [07:54:58] 10SRE, 10ops-eqsin, 10Traffic, 10User-MediaJS: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10elukey) Only nit - cp5006's mgmt is still not reachable, we should follow up. [07:58:05] (03CR) 10Vgutierrez: [C: 03+1] Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/702959 (owner: 10Elukey) [08:02:17] (03CR) 10Elukey: [C: 03+2] Revert "Depool eqsin" [dns] - 10https://gerrit.wikimedia.org/r/702959 (owner: 10Elukey) [08:02:40] !log repool eqsin after equinix maintenance - T286113 [08:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:50] T286113: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 [08:16:06] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is CRITICAL: 36.94 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:37:39] expected, eqsin repooled --^ [08:41:42] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:07:53] (03PS1) 10Majavah: kubeadm: Upgrade Calico to v3.18.4 [puppet] - 10https://gerrit.wikimedia.org/r/703061 (https://phabricator.wikimedia.org/T280342) [09:11:34] (03PS2) 10Majavah: kubeadm: Upgrade Calico to v3.18.4 [puppet] - 10https://gerrit.wikimedia.org/r/703061 (https://phabricator.wikimedia.org/T280342) [09:51:50] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:41:59] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 104 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:53:53] 10SRE, 10ops-eqsin, 10Traffic, 10User-MediaJS: IPMI Sensor Status Power_Supply Status: Critical on various eqsin servers - https://phabricator.wikimedia.org/T286113 (10elukey) 05Open→03Resolved a:03elukey [14:06:56] PROBLEM - SSH on logstash2021.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:16:58] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:49:14] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:06] RECOVERY - SSH on cp5006.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:14:12] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:48] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:57:14] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:08:34] RECOVERY - SSH on logstash2021.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:58:02] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:04:17] DannyS712: just seeing above. i can deploy that revert. [17:05:45] (assuming someone is around to test.) [17:16:44] ^ going ahead with above. i think from T286140 it should be clear if fix worked. [17:16:45] T286140: AbuseLog no longer recording revids of saved edits - https://phabricator.wikimedia.org/T286140 [17:20:10] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Replace depricating method IContextSource::getWikiPage to WikiPageFactory usage" [extensions/AbuseFilter] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702957 (https://phabricator.wikimedia.org/T286140) (owner: 10DannyS712) [17:38:33] (03Merged) 10jenkins-bot: Revert "Replace depricating method IContextSource::getWikiPage to WikiPageFactory usage" [extensions/AbuseFilter] (wmf/1.37.0-wmf.12) - 10https://gerrit.wikimedia.org/r/702957 (https://phabricator.wikimedia.org/T286140) (owner: 10DannyS712) [17:43:53] !log brennen@deploy1002 Synchronized php-1.37.0-wmf.12/extensions/AbuseFilter/includes/AbuseFilterHooks.php: Backport: [[gerrit:702957|Revert "Replace depricating method IContextSource::getWikiPage to WikiPageFactory usage" (T286140)]] (duration: 01m 06s) [17:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:03] T286140: AbuseLog no longer recording revids of saved edits - https://phabricator.wikimedia.org/T286140 [17:45:45] sDrewth, DannyS712: above deployed, confirmed that diff links appear on new entries on https://en.wikipedia.org/wiki/Special:AbuseLog [17:46:41] woop woop, thx brennen [17:55:39] thanks brennen :) [18:01:16] you bet. [18:03:54] That's the second regression caused by that line of work [21:17:05] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman doesn't replace email in notice when changing subscription email - https://phabricator.wikimedia.org/T286149 (10Legoktm) This was fixed upstream in https://gitlab.com/mailman/postorius/-/commit/b7fcca522ac0dd86831eb9788a8ec13abcdd2dd4 [21:21:06] 10SRE, 10Wikimedia-Mailing-lists: Mailman 3: Changing email address seems to break subscription for listadmins list - https://phabricator.wikimedia.org/T282328 (10Legoktm) Some related looking upstream issues are: https://gitlab.com/mailman/postorius/-/issues/472 and https://gitlab.com/mailman/postorius/-/issu... [21:26:40] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10Legoktm) a:05Legoktm→03Platonides Sorry, not sure why I dropped the ball on this. >>! In T265568#7196242, @Platonides wrote: > This can't be //that// hard.... [22:00:15] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10Platonides) Well, having too many things is probably part of the reason ;-) I'll have a look. We will see if I end up regretting being so optimistic :P [22:16:08] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10Platonides) And, weird enough, it both [[ https://lists.wikimedia.org/hyperkitty/list/daily-image-l@lists.wikimedia.org/thread/5TOP2JJ5WZJ2PC6PKFZTITF7BFZ2H62A/ |... [23:53:35] 10SRE, 10Wikimedia-Mailing-lists: Mailman 3: Changing email address seems to break subscription for listadmins list - https://phabricator.wikimedia.org/T282328 (10TerraCodes) >>! In T282328#7196785, @Legoktm wrote: > Some related looking upstream issues are: https://gitlab.com/mailman/postorius/-/issues/472 an...