[00:43:39] 10SRE, 10netops: Higher latency on Lumen eqiad/esams link - https://phabricator.wikimedia.org/T277654 (10faidon) @wiki_willy I think you mentioned that you talked to the account reps about this, unless I misunderstood this. If so, what's the current status of those conversations? [01:04:31] 10SRE, 10netops: Higher latency on Lumen eqiad/esams link - https://phabricator.wikimedia.org/T277654 (10wiki_willy) Hi @faidon - yup, here was my previous update - https://phabricator.wikimedia.org/T273308#6954080 I've brought up the concern (along with a few other concerns) to both our account rep and his m... [01:08:45] (03PS1) 10Papaul: ADD new mw servers in rack A5 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/677057 (https://phabricator.wikimedia.org/T274171) [01:25:22] Jdlrobson: are you around at the moment? [01:33:35] (03CR) 10Papaul: [C: 03+2] ADD new mw servers in rack A5 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/677057 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul) [01:36:11] replied on-task. will roll out the clean revert tomorrow if no reply. should be harmless and uncontroversial, but if there's time for something better, that's fine to. [01:36:15] * Krinkle zzz [01:40:31] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decommission scb200[1-6].codfw.wmnet - https://phabricator.wikimedia.org/T275760 (10Papaul) 05Open→03Resolved complete [01:47:43] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [01:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:48:54] PROBLEM - MariaDB Replica Lag: s4 on db2106 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1425.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:49:14] PROBLEM - MariaDB Replica Lag: s4 on db2147 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1445.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:53:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:55:23] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:44] (03PS1) 10Krinkle: [legacy] Restore old floating style inside Vector [skins/Vector] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676955 [02:07:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.38 [core] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/677063 [02:40:55] (03CR) 10Slaporte: [C: 03+1] "Approved from Legal, with a minor punctuation correction." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676922 (https://phabricator.wikimedia.org/T278409) (owner: 10Ottomata) [03:24:58] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:30:02] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:17:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:20:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:21:39] Urbanecm, bd808: you could use mwph to strip out tags [04:55:54] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:50] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:34:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2106 and db2147 after a crash', diff saved to https://phabricator.wikimedia.org/P15151 and previous config saved to /var/cache/conftool/dbconfig/20210406-053427-marostegui.json [05:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:58] PROBLEM - Disk space on backup2002 is CRITICAL: DISK CRITICAL - free space: /srv 3007686 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [05:44:54] RECOVERY - MariaDB Replica Lag: s4 on db2106 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:50:04] RECOVERY - MariaDB Replica Lag: s4 on db2147 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:50:08] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:53:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1149 for upgrade', diff saved to https://phabricator.wikimedia.org/P15152 and previous config saved to /var/cache/conftool/dbconfig/20210406-055324-marostegui.json [05:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:45] 10SRE, 10Wikimedia-Mailing-lists: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm) I went through the upstream changelog, there are some improvements to the list import process, as well as implementing bounce processing. Given that we have to pay this upgr... [05:55:14] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create production databases for mailman3 - https://phabricator.wikimedia.org/T278614 (10Marostegui) That's good news Amir. Thanks for testing it. Let's go for the large mailing lists import to see how it looks indeed. [06:03:10] 10SRE, 10Wikimedia-Mailing-lists: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10MoritzMuehlenhoff) What's the timeline for the actual Mailman 3 migration? Early steps for making bullseye usable are ongoing and we'll be able to run a few machines on bullseye even... [06:06:19] (03CR) 10Muehlenhoff: [C: 03+2] Add postgresql-server-dev-all to package builder packages [puppet] - 10https://gerrit.wikimedia.org/r/676395 (https://phabricator.wikimedia.org/T277064) (owner: 10Muehlenhoff) [06:07:53] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/676893 (https://phabricator.wikimedia.org/T278265) (owner: 10Effie Mouzeli) [06:10:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 25%: Repool db1149', diff saved to https://phabricator.wikimedia.org/P15153 and previous config saved to /var/cache/conftool/dbconfig/20210406-061028-root.json [06:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:25] (03CR) 10Muehlenhoff: [C: 03+1] admin: add bwang and cjming to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/676893 (https://phabricator.wikimedia.org/T278265) (owner: 10Effie Mouzeli) [06:12:47] 10SRE, 10Wikimedia-Mailing-lists: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10Legoktm) >>! In T278905#6974900, @MoritzMuehlenhoff wrote: > What's the timeline for the actual Mailman 3 migration? Probably have a real mailman3 install in a monthish, and then m... [06:15:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076 for decommission T274752', diff saved to https://phabricator.wikimedia.org/P15154 and previous config saved to /var/cache/conftool/dbconfig/20210406-061500-marostegui.json [06:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:09] T274752: decommission db1076.eqiad.wmnet - https://phabricator.wikimedia.org/T274752 [06:15:16] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:16:15] (03PS1) 10Marostegui: db1076: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/677085 (https://phabricator.wikimedia.org/T274752) [06:16:34] 10SRE, 10Wikimedia-Mailing-lists: Reconsider which mailman3 version we're running - https://phabricator.wikimedia.org/T278905 (10MoritzMuehlenhoff) >>! In T278905#6974925, @Legoktm wrote: > Our transition plan is to install mailman3 on the current mailman2 host (lists1001) so we can serve both from lists.wikim... [06:16:45] (03CR) 10Marostegui: [C: 03+2] db1076: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/677085 (https://phabricator.wikimedia.org/T274752) (owner: 10Marostegui) [06:23:12] (03CR) 10Muehlenhoff: "Looks good, one comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676868 (owner: 10Jbond) [06:25:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 50%: Repool db1149', diff saved to https://phabricator.wikimedia.org/P15155 and previous config saved to /var/cache/conftool/dbconfig/20210406-062532-root.json [06:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:28] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10ayounsi) As discussed over IRC a while ago, this is mostly due to the network being more used in the eqia... [06:30:57] (03PS1) 10Elukey: statistics::wmde::graphite: disable wmde-toolkit-analyzer-build timer [puppet] - 10https://gerrit.wikimedia.org/r/677087 (https://phabricator.wikimedia.org/T278665) [06:32:21] (03CR) 10Elukey: [C: 03+2] statistics::wmde::graphite: disable wmde-toolkit-analyzer-build timer [puppet] - 10https://gerrit.wikimedia.org/r/677087 (https://phabricator.wikimedia.org/T278665) (owner: 10Elukey) [06:38:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020 for upgrade', diff saved to https://phabricator.wikimedia.org/P15156 and previous config saved to /var/cache/conftool/dbconfig/20210406-063759-marostegui.json [06:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool es1020', diff saved to https://phabricator.wikimedia.org/P15157 and previous config saved to /var/cache/conftool/dbconfig/20210406-063858-marostegui.json [06:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 for upgrade', diff saved to https://phabricator.wikimedia.org/P15158 and previous config saved to /var/cache/conftool/dbconfig/20210406-063938-marostegui.json [06:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 75%: Repool db1149', diff saved to https://phabricator.wikimedia.org/P15159 and previous config saved to /var/cache/conftool/dbconfig/20210406-064036-root.json [06:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:58] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:47:22] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:49:48] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:51:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1169 for schema change', diff saved to https://phabricator.wikimedia.org/P15160 and previous config saved to /var/cache/conftool/dbconfig/20210406-065131-marostegui.json [06:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1149 (re)pooling @ 100%: Repool db1149', diff saved to https://phabricator.wikimedia.org/P15161 and previous config saved to /var/cache/conftool/dbconfig/20210406-065539-root.json [06:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:40] 10SRE, 10serviceops: Puppet TLS certs to renew - https://phabricator.wikimedia.org/T279410 (10elukey) [07:14:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repool es1022', diff saved to https://phabricator.wikimedia.org/P15162 and previous config saved to /var/cache/conftool/dbconfig/20210406-071446-root.json [07:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:45] (03CR) 10Elukey: "n00b question about helmfile_(namespaces|rbac|psp).yaml - should those be split/reviewed as well?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [07:20:32] (03PS1) 10Marostegui: mariadb: Productionize db1184 [puppet] - 10https://gerrit.wikimedia.org/r/677097 (https://phabricator.wikimedia.org/T275633) [07:20:49] !log swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 T268435 [07:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:58] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [07:20:58] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [07:26:14] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) > Are the backup long TCP sessions or many small ones? I would have to prove myself wrong with... [07:29:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repool es1022', diff saved to https://phabricator.wikimedia.org/P15164 and previous config saved to /var/cache/conftool/dbconfig/20210406-072950-root.json [07:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:03] (03CR) 10Filippo Giunchedi: "Thanks for the review!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676284 (owner: 10Filippo Giunchedi) [07:33:11] 10SRE, 10SRE-swift-storage: Problems generating thumbnails - https://phabricator.wikimedia.org/T278969 (10jcrespo) If what we have is 429s (rate limiting errors), I would merge this there. Feel free to reopen if you had a different experience. [07:34:02] 10SRE, 10SRE-swift-storage: Problems generating thumbnails - https://phabricator.wikimedia.org/T278969 (10jcrespo) [07:34:31] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10jcrespo) [07:37:24] (03CR) 10Jcrespo: "Hey, I love the collaboration dynamic here! Keep it going." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [07:39:18] (03CR) 10Jcrespo: "recheck" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [07:44:15] (03CR) 10Jcrespo: "This is ready to be merged- executable code looks clean, and even if testing code is a bit verbose, having tests > no tests. 2 last nitpic" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [07:44:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, though I think if possible $cluster_name should be default to undef to make it clear(er) there's no cluster_name (non blocki" [puppet] - 10https://gerrit.wikimedia.org/r/676631 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [07:44:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repool es1022', diff saved to https://phabricator.wikimedia.org/P15165 and previous config saved to /var/cache/conftool/dbconfig/20210406-074453-root.json [07:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:18] (03CR) 10Sascha: [C: 03+1] dumps.wikimedia.org - add section in legal specifying CC0 license for analytics [puppet] - 10https://gerrit.wikimedia.org/r/676922 (https://phabricator.wikimedia.org/T278409) (owner: 10Ottomata) [07:48:14] RECOVERY - Disk space on backup2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [07:48:28] (03CR) 10Filippo Giunchedi: "LGTM, out of curiosity will it be a problem the fact that we're also provisioning the kafka::logging role at the same time as making the b" [puppet] - 10https://gerrit.wikimedia.org/r/677009 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [07:48:59] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10klausman) 05Open→03Resolved Yes, this is all done! [07:51:16] 10SRE, 10LDAP-Access-Requests, 10CAS-SSO: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10MoritzMuehlenhoff) >>! In T279244#6972433, @jbond wrote: > tagging @MoritzMuehlenhoff and @RobH as it seems we may need to revisit this decision Depends on the use case I guess. If there's really need... [07:51:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, though please loop in service ops folks for heads up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [07:53:29] (03CR) 10Filippo Giunchedi: [C: 03+1] add mwlog[12]002 to profile::dumps::rsync_internal_clients [puppet] - 10https://gerrit.wikimedia.org/r/676997 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [07:53:40] (03CR) 10Filippo Giunchedi: [C: 03+1] point wikimania scholarships to mwlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/676995 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [07:54:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM. Might be worth extracting this variable to hiera while we're at it (non-blocking)" [puppet] - 10https://gerrit.wikimedia.org/r/676996 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [07:54:54] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: set cluster name for elasticsearch outputs [puppet] - 10https://gerrit.wikimedia.org/r/676685 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [07:55:06] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: remove logstash output on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/676477 (https://phabricator.wikimedia.org/T234854) (owner: 10Cwhite) [07:55:30] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add curator config to manage w3creportingapi revision 1 indexes [puppet] - 10https://gerrit.wikimedia.org/r/676690 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [07:58:14] (03CR) 10Filippo Giunchedi: logstash: use logstash output to manage ecs-test indexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676687 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [07:59:05] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: replace ECS allow list with filter_on_template [puppet] - 10https://gerrit.wikimedia.org/r/674718 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [07:59:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repool es1022', diff saved to https://phabricator.wikimedia.org/P15166 and previous config saved to /var/cache/conftool/dbconfig/20210406-075957-root.json [08:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:43] !log installing underscore security updates on buster [08:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:43] (03CR) 10Effie Mouzeli: [C: 03+2] admin: add bwang and cjming to ldap_only users [puppet] - 10https://gerrit.wikimedia.org/r/676893 (https://phabricator.wikimedia.org/T278265) (owner: 10Effie Mouzeli) [08:14:35] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Clare Ming - https://phabricator.wikimedia.org/T278265 (10jijiki) 05Open→03Resolved [08:15:14] (03PS1) 10Elukey: hadoop: refactor and simplify the Yarn Capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/677102 (https://phabricator.wikimedia.org/T277062) [08:15:16] (03PS1) 10Elukey: hadoop: fix the test cluster's yarn queue settings [puppet] - 10https://gerrit.wikimedia.org/r/677103 (https://phabricator.wikimedia.org/T277062) [08:15:17] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for Bernard Wang - https://phabricator.wikimedia.org/T279014 (10jijiki) 05Open→03Resolved a:03jijiki [08:18:09] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10fgiunchedi) a:03Papaul @papaul please replace the failed 4TB disk, led should be blinking, thank you ! [08:20:25] (03PS3) 10Marostegui: tendril: Migrate crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/675308 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:21:46] (03CR) 10Marostegui: [C: 03+2] tendril: Migrate crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/675308 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:22:15] (03PS2) 10Elukey: hadoop: refactor and simplify the Yarn Capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/677102 (https://phabricator.wikimedia.org/T277062) [08:22:17] (03PS2) 10Elukey: hadoop: fix the test cluster's yarn queue settings [puppet] - 10https://gerrit.wikimedia.org/r/677103 (https://phabricator.wikimedia.org/T277062) [08:27:19] (03CR) 10Alexandros Kosiaris: "> Patch Set 3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/675558 (https://phabricator.wikimedia.org/T278208) (owner: 10Elukey) [08:30:21] (03CR) 10Elukey: [C: 03+2] hadoop: refactor and simplify the Yarn Capacity scheduler's settings [puppet] - 10https://gerrit.wikimedia.org/r/677102 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:30:37] (03CR) 10Elukey: [C: 03+2] hadoop: fix the test cluster's yarn queue settings [puppet] - 10https://gerrit.wikimedia.org/r/677103 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [08:32:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove weight from es4 master', diff saved to https://phabricator.wikimedia.org/P15167 and previous config saved to /var/cache/conftool/dbconfig/20210406-083248-marostegui.json [08:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:48] PROBLEM - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:06] jan_drewniak: hey, is there a chance we can get https://gerrit.wikimedia.org/r/c/wikimedia/portals/+/668246 deployed? [08:37:21] I thought these get deployed every Monday automatically [08:39:08] Amir1: ^ that's probably because of the change? [08:41:24] it's very likely [08:45:50] (03Abandoned) 10David Caro: reprepro: use a different key for the k8s repo [puppet] - 10https://gerrit.wikimedia.org/r/676342 (https://phabricator.wikimedia.org/T279042) (owner: 10David Caro) [08:46:02] RECOVERY - Check systemd state on dbmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1184 [puppet] - 10https://gerrit.wikimedia.org/r/677097 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:49:12] (03CR) 10Jbond: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676284 (owner: 10Filippo Giunchedi) [08:50:13] (03PS1) 10David Caro: aptrepo: update k8s repo key [puppet] - 10https://gerrit.wikimedia.org/r/677110 (https://phabricator.wikimedia.org/T279042) [08:59:37] (03PS1) 10Ladsgroup: tendril: Remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/677111 (https://phabricator.wikimedia.org/T273673) [09:00:44] (03PS1) 10Elukey: hadoop: set dr.who as yarn admin in the Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/677112 (https://phabricator.wikimedia.org/T277062) [09:01:08] (03CR) 10Elukey: [C: 03+2] hadoop: set dr.who as yarn admin in the Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/677112 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [09:01:39] (03CR) 10Ladsgroup: "Given that the changes have arrived at dbmonitor, I think it's safe to merge." [puppet] - 10https://gerrit.wikimedia.org/r/677111 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [09:02:05] (03PS1) 10Effie Mouzeli: hieradata: remove parsoidJS from LVS 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [09:03:13] (03CR) 10jerkins-bot: [V: 04-1] hieradata: remove parsoidJS from LVS 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [09:03:32] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [09:05:40] (03PS1) 10Muehlenhoff: Make bast1003 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/677116 (https://phabricator.wikimedia.org/T276399) [09:07:13] (03PS2) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [09:07:32] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:08:18] (03CR) 10jerkins-bot: [V: 04-1] hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) (owner: 10Effie Mouzeli) [09:14:30] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [09:18:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15168 and previous config saved to /var/cache/conftool/dbconfig/20210406-091818-root.json [09:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:55] (03PS3) 10Effie Mouzeli: hieradata: remove parsoidJS from production 3 [puppet] - 10https://gerrit.wikimedia.org/r/677114 (https://phabricator.wikimedia.org/T279059) [09:19:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:20:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:21:26] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/677116 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [09:21:28] (03PS1) 10Effie Mouzeli: hieradata: remove parsoidJS from production 4 [puppet] - 10https://gerrit.wikimedia.org/r/677118 (https://phabricator.wikimedia.org/T279059) [09:23:27] (03PS1) 10Effie Mouzeli: hieradata: remove parsoidJS from production 5 [puppet] - 10https://gerrit.wikimedia.org/r/677119 (https://phabricator.wikimedia.org/T279059) [09:24:05] (03PS1) 10Ladsgroup: Disable legacy javascript globals on all wikis except some big ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677120 (https://phabricator.wikimedia.org/T72470) [09:25:32] (03PS3) 10Jbond: ssh::server: Support additional options for PermitRootLogin [puppet] - 10https://gerrit.wikimedia.org/r/676868 [09:26:31] (03PS1) 10Elukey: install_server: add reuse recipe for an-coord100* nodes [puppet] - 10https://gerrit.wikimedia.org/r/677121 (https://phabricator.wikimedia.org/T278424) [09:26:40] (03PS4) 10Jbond: ssh::server: Support additional options for PermitRootLogin [puppet] - 10https://gerrit.wikimedia.org/r/676868 [09:26:43] (03CR) 10Jbond: ssh::server: Support additional options for PermitRootLogin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676868 (owner: 10Jbond) [09:28:07] !log renew puppet cert for kraz T279410 [09:28:10] !log jbond@cumin1001 START - Cookbook sre.puppet.renew-cert for kraz.wikimedia.org: Renew puppet certificate - jbond@cumin1001 [09:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:17] T279410: Puppet TLS certs to renew - https://phabricator.wikimedia.org/T279410 [09:28:18] (03CR) 10Elukey: [C: 03+2] install_server: add reuse recipe for an-coord100* nodes [puppet] - 10https://gerrit.wikimedia.org/r/677121 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [09:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:06] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for kraz.wikimedia.org: Renew puppet certificate - jbond@cumin1001 [09:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:43] !log jbond@cumin1001 START - Cookbook sre.puppet.renew-cert for conf2002.codfw.wmnet: Renew puppet certificate - jbond@cumin1001 [09:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:13] (03PS1) 10Elukey: install_server: set Buster for an-(master|coord) nodes [puppet] - 10https://gerrit.wikimedia.org/r/677122 (https://phabricator.wikimedia.org/T278424) [09:30:37] 10SRE, 10Services, 10Service-deployment-requests, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10Chtnnh) [09:30:40] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for conf2002.codfw.wmnet: Renew puppet certificate - jbond@cumin1001 [09:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:10] !log jbond@cumin1001 START - Cookbook sre.puppet.renew-cert for conf2003.codfw.wmnet: Renew puppet certificate - jbond@cumin1001 [09:31:13] (03CR) 10Elukey: [C: 03+2] install_server: set Buster for an-(master|coord) nodes [puppet] - 10https://gerrit.wikimedia.org/r/677122 (https://phabricator.wikimedia.org/T278424) (owner: 10Elukey) [09:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:54] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:32:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for conf2003.codfw.wmnet: Renew puppet certificate - jbond@cumin1001 [09:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15169 and previous config saved to /var/cache/conftool/dbconfig/20210406-093322-root.json [09:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:59] 10SRE, 10serviceops: Puppet TLS certs to renew - https://phabricator.wikimedia.org/T279410 (10jbond) 05Open→03Resolved a:03jbond I have now renewed all certs, thanks [09:34:50] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:41:28] !log Start server side upload for 4 video files (T279197, T279196, T279195, T279194) [09:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:43] T279197: Server side upload for Sturm - https://phabricator.wikimedia.org/T279197 [09:41:43] T279194: Server side upload for Sturm - https://phabricator.wikimedia.org/T279194 [09:41:44] T279196: Server side upload for Sturm - https://phabricator.wikimedia.org/T279196 [09:41:44] T279195: Server side upload for Sturm - https://phabricator.wikimedia.org/T279195 [09:43:10] out of curiosity, how many files are there total? [09:45:06] apergos: right now, 34 files are pending an upload. [09:45:14] but that can obv change as requests come [09:45:20] sure [09:45:22] (03PS1) 10Marostegui: install_server: Do not reimage db1184 [puppet] - 10https://gerrit.wikimedia.org/r/677123 [09:45:26] see https://phabricator.wikimedia.org/project/board/178/?filter=h4z63ti9MPIc for a list apergos [09:45:44] ah a workboard even. nice! [09:46:15] apergos: it's not only for uplods, it's for all site requests. I filtered it to tasks having "server side upload" in title. [09:46:21] (03PS1) 10Jbond: cfssl: updated cfssl-refresh script to us dbconfig directly [puppet] - 10https://gerrit.wikimedia.org/r/677124 [09:46:27] 👍 [09:47:18] bookmarked (the workboard as a whole) [09:47:33] (y) [09:47:43] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/676868 (owner: 10Jbond) [09:48:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15170 and previous config saved to /var/cache/conftool/dbconfig/20210406-094825-root.json [09:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:12] (03CR) 10Muehlenhoff: [C: 03+2] Make bast1003 a bastion [puppet] - 10https://gerrit.wikimedia.org/r/677116 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [09:49:19] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1184 [puppet] - 10https://gerrit.wikimedia.org/r/677123 (owner: 10Marostegui) [09:49:29] !log Start server-side upload for 1 video file (T279418) [09:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:37] moritzm: ok to merge your change? [09:49:44] (03CR) 10JMeybohm: [C: 03+1] Revert "Allow RunAsAny in the restricted PSP as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/676512 (owner: 10Alexandros Kosiaris) [09:49:46] T279418: Server side upload for PantheraLeo1359531 - https://phabricator.wikimedia.org/T279418 [09:49:48] marostegui: ack, please do [09:49:57] moritzm: done! [09:50:00] thx [09:50:43] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-coord1002.eqiad.wmnet with reason: REIMAGE [09:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-coord1002.eqiad.wmnet with reason: REIMAGE [09:52:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:08] (03PS1) 10Muehlenhoff: wmflib: Switch spec test to bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) [09:56:41] (03PS2) 10Muehlenhoff: profile::conftool::client: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/668019 [09:58:47] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Switch spec test to bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [10:02:08] (03PS2) 10Muehlenhoff: wmflib: Switch spec test to bast1003 [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) [10:03:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repool db1169', diff saved to https://phabricator.wikimedia.org/P15171 and previous config saved to /var/cache/conftool/dbconfig/20210406-100329-root.json [10:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28916/console" [puppet] - 10https://gerrit.wikimedia.org/r/677124 (owner: 10Jbond) [10:10:39] (03PS2) 10Jbond: cfssl: updated cfssl-refresh script to us dbconfig directly [puppet] - 10https://gerrit.wikimedia.org/r/677124 [10:13:35] (03PS3) 10Jbond: cfssl: updated cfssl-refresh script to us dbconfig directly [puppet] - 10https://gerrit.wikimedia.org/r/677124 [10:13:53] (03CR) 10Jbond: [C: 03+2] ssh::server: Support additional options for PermitRootLogin [puppet] - 10https://gerrit.wikimedia.org/r/676868 (owner: 10Jbond) [10:15:30] (03CR) 10Jbond: [C: 03+2] cfssl: updated cfssl-refresh script to us dbconfig directly [puppet] - 10https://gerrit.wikimedia.org/r/677124 (owner: 10Jbond) [10:17:08] (03PS3) 10David Caro: toolforge.checker: Update list of etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/676409 (https://phabricator.wikimedia.org/T267082) [10:19:33] PROBLEM - Check systemd state on mc1034 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:12] Is it just me or people also have a favorite stat machine? [10:25:52] Amir1: i generally use stat1005 [10:26:36] (03PS4) 10David Caro: toolforge.checker: Update list of etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/676409 (https://phabricator.wikimedia.org/T267082) [10:26:38] (03PS1) 10Jbond: P:pki::multirootca: ocsprefresh dose need the actuall dbconfig not just json [puppet] - 10https://gerrit.wikimedia.org/r/677216 [10:26:46] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Switch to cluster internal DNS name for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/674008 (owner: 10JMeybohm) [10:27:18] same here, it's just I don't have a particular reason for it [10:27:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28919/console" [puppet] - 10https://gerrit.wikimedia.org/r/677216 (owner: 10Jbond) [10:29:25] (03Merged) 10jenkins-bot: admin_ng: Switch to cluster internal DNS name for API [deployment-charts] - 10https://gerrit.wikimedia.org/r/674008 (owner: 10JMeybohm) [10:30:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28920/console" [puppet] - 10https://gerrit.wikimedia.org/r/674607 (https://phabricator.wikimedia.org/T269461) (owner: 10JMeybohm) [10:31:00] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/28915/" [puppet] - 10https://gerrit.wikimedia.org/r/668019 (owner: 10Muehlenhoff) [10:31:08] PROBLEM - Check whether ferm is active by checking the default input chain on mc1034 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:31:45] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [10:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:51] Urbanecm: o/ Is there a process for adding a new private config variable? [10:32:52] (03CR) 10David Caro: toolforge.checker: Update list of etcd nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676409 (https://phabricator.wikimedia.org/T267082) (owner: 10David Caro) [10:33:08] ^^ Amir1 also [10:33:11] phuedx: you mean to PrivateSettings.php? No, you need to commit it there before syncing. [10:33:15] 10SRE, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): Get traffic team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273737 (10aborrero) [10:33:18] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [10:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:01] Urbanecm: Oh. Thanks. That's simple. So add the variable/commit on the deployment host. Sync? [10:34:04] phuedx: I assume private config is similar to a security patch [10:34:33] phuedx: add it to /srv/mediawiki-stagging/private/, commit, sync [10:34:47] (03PS1) 10Muehlenhoff: Change SSH default config to use bast1003, bast1002 is going away [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677218 [10:34:49] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: ocsprefresh dose need the actuall dbconfig not just json [puppet] - 10https://gerrit.wikimedia.org/r/677216 (owner: 10Jbond) [10:34:57] Urbanecm, Amir1: Thanks, both [10:35:06] phuedx: note that `private` is a git repo on its own, so you need to cd there to be able to commit [10:35:24] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s users: Remove special case for migration, use list of groups [puppet] - 10https://gerrit.wikimedia.org/r/674607 (https://phabricator.wikimedia.org/T269461) (owner: 10JMeybohm) [10:35:54] Should this be done during a backport window? [10:36:42] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Change SSH default config to use bast1003, bast1002 is going away [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677218 (owner: 10Muehlenhoff) [10:36:47] phuedx: yeah, also coordinate with SRE the details I assume [10:37:52] Amir1: why SRE? It's a normal deployment, not anything like private puppet. [10:39:02] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:02] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:08] RECOVERY - Check systemd state on mc1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:11] SRE and RelEng should be coordinated for all the deployments so if things go wrong, they can pin point it [10:39:11] (03PS1) 10David Caro: wmcs.toolforge: isort and black [cookbooks] - 10https://gerrit.wikimedia.org/r/677219 [10:39:13] (03PS1) 10David Caro: wmcs.toolforge: use natural sorting to sort hostnames [cookbooks] - 10https://gerrit.wikimedia.org/r/677220 [10:39:16] (03PS1) 10David Caro: toolforge.delete_etcd_node: remove lowest index vm if no fqdn passed [cookbooks] - 10https://gerrit.wikimedia.org/r/677221 [10:39:40] for usual stuff they can look SAL and deployment calendar but this is not there [10:40:09] does it make sense what I'm saying. [10:40:33] e.g. if the latency for databases growths after the deployment, we should be able to see it [10:41:42] right. Well, if it's just a private config variable, it can be synced with sal enabled (assuming only the variable's content needs to be private, not the fact that we even have it) [10:41:54] (03PS1) 10Muehlenhoff: Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677222 [10:42:47] (03CR) 10jerkins-bot: [V: 04-1] wmcs.toolforge: use natural sorting to sort hostnames [cookbooks] - 10https://gerrit.wikimedia.org/r/677220 (owner: 10David Caro) [10:43:23] (03CR) 10jerkins-bot: [V: 04-1] toolforge.delete_etcd_node: remove lowest index vm if no fqdn passed [cookbooks] - 10https://gerrit.wikimedia.org/r/677221 (owner: 10David Caro) [10:44:24] Amir1, Urbanecm: The config variable is part of the WikimediaEvents repo, so it's known that the variable exist. The contents of the variable does need to be private [10:44:30] *exists [10:45:06] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/677222 (owner: 10Muehlenhoff) [10:45:52] yeah, just make sure there's a SAL and mention variable itself if possible [10:46:23] phuedx: yeah, so just sync normally with a verbose message (so it's clear what you do, as you can't link gerrit patchset) [10:48:14] Great! Thanks again, both :) [10:52:32] (03PS3) 10Jbond: hiera cloud: update sso-grafana to match production [puppet] - 10https://gerrit.wikimedia.org/r/676885 [10:53:05] (03CR) 10Jbond: [C: 03+2] hiera cloud: update sso-grafana to match production [puppet] - 10https://gerrit.wikimedia.org/r/676885 (owner: 10Jbond) [10:55:30] !log remove wmf-laptop 0.5.0 from buster-wikimedia (incorrect import to main, next upload will land in component/wmf-sre-laptop) [10:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:04] !log upload wmf-laptop 0.5.1 to buster-wikimedia component/wmf-sre-laptop [10:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:38] (03PS2) 10David Caro: wmcs.toolforge: use natural sorting to sort hostnames [cookbooks] - 10https://gerrit.wikimedia.org/r/677220 [10:58:40] (03PS2) 10David Caro: toolforge.delete_etcd_node: remove lowest index vm if no fqdn passed [cookbooks] - 10https://gerrit.wikimedia.org/r/677221 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy [[Backport windows|European mid-day backport window]]
'''''' (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210406T1100). Please do the needful. [11:00:04] Amir1: A patch you scheduled for [[Backport windows|European mid-day backport window]]
'''''' is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:01:44] Amir1: let me know if I can be helpful in any way, it looks tricky. [11:02:20] RECOVERY - Check whether ferm is active by checking the default input chain on mc1034 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:02:21] itself is not crazy, it's just a config change but hopefully it's clean [11:02:47] I did more than +10K edits to clean up the mess [11:03:16] /o\ [11:03:26] (03CR) 10Ladsgroup: [C: 03+2] Disable legacy javascript globals on all wikis except some big ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677120 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:04:12] (03Merged) 10jenkins-bot: Disable legacy javascript globals on all wikis except some big ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677120 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:06:41] https://grafana.wikimedia.org/d/000000037/mw-js-deprecate?viewPanel=7&orgId=1&refresh=1m&from=now-90d&to=now&var-Step=24h [11:07:03] The ones left are basically scripts that iterate over the global object (because why not) [11:07:09] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:677120|Disable legacy javascript globals on all wikis except some big ones]] (T72470) (duration: 01m 01s) [11:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:16] T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470 [11:07:29] (note that it's 1:100 sample) [11:09:09] phuedx: the floor is yours [11:11:27] phuedx: which variable are you adding? just wondering if it's also needed on betacluster or not [11:12:09] Amir1: Thanks :) I wasn't actually planning on doing it today though. More, I was asking so I understood for later [11:12:44] Majavah: That's a good point. I'll ask members of my team if it needs to be enabled on the Beta Cluster [11:15:57] That said, I don't see a reason why it couldn't be deployed now. The change that needs it will be riding this week's train [11:16:42] 10SRE, 10Patch-For-Review: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10MoritzMuehlenhoff) bast1003 is up and running; I've sent an announcement to the ops list so that people update their configs. Will open a decom task next week. [11:17:02] 10SRE, 10Patch-For-Review: migrate services from bast1002 to bast1003 - https://phabricator.wikimedia.org/T276399 (10MoritzMuehlenhoff) [11:23:16] Majavah: Yeah. I see the note about adding the config variable on the Beta Cluster. I'll ask my team and hopefully have a go at adding the variable during tomorrow's backport window /cc Amir1 [11:28:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1164 for schema change', diff saved to https://phabricator.wikimedia.org/P15172 and previous config saved to /var/cache/conftool/dbconfig/20210406-112839-marostegui.json [11:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:02] (03CR) 10JMeybohm: [C: 03+2] Revert "Allow RunAsAny in the restricted PSP as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/676512 (owner: 10Alexandros Kosiaris) [11:32:24] (03Merged) 10jenkins-bot: Revert "Allow RunAsAny in the restricted PSP as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/676512 (owner: 10Alexandros Kosiaris) [11:37:35] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:44] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:37:50] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:56] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:51] !log removed mw2247 from debmonitor T277780 [11:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:59] T277780: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 [11:45:12] (03PS2) 10ArielGlenn: pylint fixups for batches-related code [dumps] - 10https://gerrit.wikimedia.org/r/677032 (https://phabricator.wikimedia.org/T252396) [11:45:54] (03PS1) 10Jbond: P:grafana: Add ForceLogin to redirects [puppet] - 10https://gerrit.wikimedia.org/r/677226 [11:47:44] (03PS2) 10Jbond: P:grafana: Add ForceLogin to redirects [puppet] - 10https://gerrit.wikimedia.org/r/677226 (https://phabricator.wikimedia.org/T269272) [11:51:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28922/console" [puppet] - 10https://gerrit.wikimedia.org/r/677226 (https://phabricator.wikimedia.org/T269272) (owner: 10Jbond) [11:54:28] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [11:55:45] (03PS1) 10JMeybohm: Remove deployment_server_secrets::admin_services [labs/private] - 10https://gerrit.wikimedia.org/r/677227 (https://phabricator.wikimedia.org/T268434) [11:55:52] (03PS1) 10JMeybohm: helmfile: Remove deployment_server_secrets::admin_services [puppet] - 10https://gerrit.wikimedia.org/r/677228 (https://phabricator.wikimedia.org/T268434) [11:57:55] !log installing openjpeg2 security updates on buster [11:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:01] Amir1: ^ [11:59:21] that's expected [11:59:27] let's see how big it is [12:00:16] https://grafana.wikimedia.org/d/000000566/overview?viewPanel=16&orgId=1 [12:00:23] graph went weeeee [12:00:41] it's nothing serious and tbh we shouldn't be handling this [12:01:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: Repool db1164', diff saved to https://phabricator.wikimedia.org/P15173 and previous config saved to /var/cache/conftool/dbconfig/20210406-120104-root.json [12:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:42] Amir1: you just gonna let editors fix then? [12:03:26] I'll fix easy ones (on my global interface editor hat) [12:04:06] Nice [12:05:40] most of it is in uzwiki with "TypeError: Cannot read property 'replace' of undefined" [12:05:55] I don't think we removed "replace" [12:06:23] Weird [12:10:48] of undefined, so it's trying to replace on something that is now undefined but likely wasn't earlier, likely a global [12:11:39] oh of course [12:11:45] btw, found it and fixed it [12:11:49] https://uz.wikipedia.org/wiki/MediaWiki:Wikibugs.js [12:12:08] the reason was it was using window.wgFoo which my script wouldn't catch [12:12:47] (03PS1) 10ArielGlenn: add a first draft of documentation for the batches processing work [dumps] - 10https://gerrit.wikimedia.org/r/677234 (https://phabricator.wikimedia.org/T252396) [12:13:53] Last time a non Sysadmin touched that was 2014 [12:15:01] (03CR) 10Jbond: [V: 03+1] "Ready for review see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677226 (https://phabricator.wikimedia.org/T269272) (owner: 10Jbond) [12:16:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 50%: Repool db1164', diff saved to https://phabricator.wikimedia.org/P15174 and previous config saved to /var/cache/conftool/dbconfig/20210406-121607-root.json [12:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:47] (03CR) 10JMeybohm: "labs/private change at https://gerrit.wikimedia.org/r/c/labs/private/+/677227" [puppet] - 10https://gerrit.wikimedia.org/r/677228 (https://phabricator.wikimedia.org/T268434) (owner: 10JMeybohm) [12:25:48] 10SRE, 10SRE-Access-Requests: Need access to noc@wikimedia.org (associated with Analytics' MaxMind account) - https://phabricator.wikimedia.org/T279310 (10ema) p:05Triage→03Medium [12:27:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you John! LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677226 (https://phabricator.wikimedia.org/T269272) (owner: 10Jbond) [12:28:32] !log installing netty security updates [12:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 75%: Repool db1164', diff saved to https://phabricator.wikimedia.org/P15175 and previous config saved to /var/cache/conftool/dbconfig/20210406-123111-root.json [12:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.toolforge: use natural sorting to sort hostnames [cookbooks] - 10https://gerrit.wikimedia.org/r/677220 (owner: 10David Caro) [12:38:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge.delete_etcd_node: remove lowest index vm if no fqdn passed [cookbooks] - 10https://gerrit.wikimedia.org/r/677221 (owner: 10David Caro) [12:39:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.toolforge: isort and black [cookbooks] - 10https://gerrit.wikimedia.org/r/677219 (owner: 10David Caro) [12:42:15] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:45] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:34] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:47] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1164 (re)pooling @ 100%: Repool db1164', diff saved to https://phabricator.wikimedia.org/P15176 and previous config saved to /var/cache/conftool/dbconfig/20210406-124614-root.json [12:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:11] (03PS3) 10Jbond: P:grafana: Add ForceLogin to redirects [puppet] - 10https://gerrit.wikimedia.org/r/677226 (https://phabricator.wikimedia.org/T269272) [13:06:04] (03PS3) 10David Caro: toolforge.delete_etcd_node: remove lowest index vm if no fqdn passed [cookbooks] - 10https://gerrit.wikimedia.org/r/677221 [13:10:13] (03PS5) 10David Caro: toolforge.checker: Update list of etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/676409 (https://phabricator.wikimedia.org/T267082) [13:13:10] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge: isort and black [cookbooks] - 10https://gerrit.wikimedia.org/r/677219 (owner: 10David Caro) [13:13:17] (03CR) 10David Caro: [C: 03+2] wmcs.toolforge: use natural sorting to sort hostnames [cookbooks] - 10https://gerrit.wikimedia.org/r/677220 (owner: 10David Caro) [13:13:27] (03CR) 10David Caro: [C: 03+2] toolforge.delete_etcd_node: remove lowest index vm if no fqdn passed [cookbooks] - 10https://gerrit.wikimedia.org/r/677221 (owner: 10David Caro) [13:16:43] (03Merged) 10jenkins-bot: wmcs.toolforge: isort and black [cookbooks] - 10https://gerrit.wikimedia.org/r/677219 (owner: 10David Caro) [13:16:45] (03Merged) 10jenkins-bot: wmcs.toolforge: use natural sorting to sort hostnames [cookbooks] - 10https://gerrit.wikimedia.org/r/677220 (owner: 10David Caro) [13:17:05] (03Merged) 10jenkins-bot: toolforge.delete_etcd_node: remove lowest index vm if no fqdn passed [cookbooks] - 10https://gerrit.wikimedia.org/r/677221 (owner: 10David Caro) [13:20:43] !log Start server-side upload for 4 video files (T279191, T279192, T279193, T279190) [13:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:54] T279190: Server side upload for Sturm - https://phabricator.wikimedia.org/T279190 [13:20:55] T279193: Server side upload for Sturm - https://phabricator.wikimedia.org/T279193 [13:20:55] T279191: Server side upload for Sturm - https://phabricator.wikimedia.org/T279191 [13:20:55] T279192: Server side upload for Sturm - https://phabricator.wikimedia.org/T279192 [13:27:30] RECOVERY - Long running screen/tmux on puppetmaster1001 is OK: OK: Tmux detected but not long running. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [13:32:34] (03CR) 10Jbond: [C: 03+1] wmflib: Switch spec test to bast1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [13:37:11] !log Retrying server-side upload for 1 file (T279192) [13:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:20] T279192: Server side upload for Sturm - https://phabricator.wikimedia.org/T279192 [13:44:09] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/677110 (https://phabricator.wikimedia.org/T279042) (owner: 10David Caro) [13:44:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1163 for schema change', diff saved to https://phabricator.wikimedia.org/P15177 and previous config saved to /var/cache/conftool/dbconfig/20210406-134418-marostegui.json [13:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:31] (03CR) 10David Caro: [C: 03+2] aptrepo: update k8s repo key [puppet] - 10https://gerrit.wikimedia.org/r/677110 (https://phabricator.wikimedia.org/T279042) (owner: 10David Caro) [13:47:15] (03PS3) 10Ottomata: dumps.wikimedia.org - add section in legal specifying CC0 license for analytics [puppet] - 10https://gerrit.wikimedia.org/r/676922 (https://phabricator.wikimedia.org/T278409) [13:47:26] (03CR) 10Ottomata: dumps.wikimedia.org - add section in legal specifying CC0 license for analytics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676922 (https://phabricator.wikimedia.org/T278409) (owner: 10Ottomata) [13:47:54] (03CR) 10Herron: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/677009 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [13:56:36] (03CR) 10Muehlenhoff: wmflib: Switch spec test to bast1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677126 (https://phabricator.wikimedia.org/T276399) (owner: 10Muehlenhoff) [13:57:24] !log upgrading sretest1002 to bullseye [13:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:40] moritzm: nice and early [14:09:33] (03CR) 10Sascha: [C: 03+1] dumps.wikimedia.org - add section in legal specifying CC0 license for analytics [puppet] - 10https://gerrit.wikimedia.org/r/676922 (https://phabricator.wikimedia.org/T278409) (owner: 10Ottomata) [14:10:22] (03CR) 10Ottomata: [C: 03+2] dumps.wikimedia.org - add section in legal specifying CC0 license for analytics [puppet] - 10https://gerrit.wikimedia.org/r/676922 (https://phabricator.wikimedia.org/T278409) (owner: 10Ottomata) [14:16:09] (03CR) 10Filippo Giunchedi: [C: 03+1] kafka-logging: migrate broker logstash1010 to kafka-logging1001 [puppet] - 10https://gerrit.wikimedia.org/r/677009 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [14:18:57] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [14:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:29] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.13.0-a31 [vendor] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/676957 (https://phabricator.wikimedia.org/T277800) [14:22:41] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [14:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:09] (03CR) 10C. Scott Ananian: [C: 03+2] "Missed the branch cut again, but sneaking this in before the train deploy." [vendor] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/676957 (https://phabricator.wikimedia.org/T277800) (owner: 10C. Scott Ananian) [14:26:19] 10SRE, 10MediaWiki-extensions-Translate, 10Datacenter-Switchover, 10Performance-Team (Radar), 10Wikimedia-production-error: DBPerformance warning "Query returned 22186 rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10LSobanski) Removing the #DBA... [14:29:21] !log populated thirdparty/ceph-octopus buster repo with reprepro (T274566) [14:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:29] T274566: [ceph] Test and upgrade to Octopus - https://phabricator.wikimedia.org/T274566 [14:30:49] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 9 hosts with reason: upgrading openstack [14:30:53] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 9 hosts with reason: upgrading openstack [14:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:58] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 97 hosts with reason: upgrading openstack [14:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:33] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 97 hosts with reason: upgrading openstack [14:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:40] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10wiki_willy) Thanks @klausman! >>! In T267670#6975131, @klausman wrote: > Yes, this is all done! [14:36:23] (03CR) 10David Caro: [C: 03+2] Horizon: put into maintenance mode for Ussuri upgrade [puppet] - 10https://gerrit.wikimedia.org/r/676847 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [14:37:48] (03CR) 10Cwhite: logstash: use logstash output to manage ecs-test indexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676687 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [14:42:44] PROBLEM - SSH on sretest1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:45:08] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-exporter-apt.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:46] (03CR) 10David Caro: [C: 03+2] Openstack eqiad1 -> Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676848 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [14:45:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Openstack eqiad1 -> Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676848 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [14:46:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: Repool db1163', diff saved to https://phabricator.wikimedia.org/P15178 and previous config saved to /var/cache/conftool/dbconfig/20210406-144612-root.json [14:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:32] RECOVERY - SSH on sretest1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:48:03] (03CR) 10Filippo Giunchedi: logstash: use logstash output to manage ecs-test indexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676687 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [14:53:48] (03CR) 10Cwhite: logstash: use logstash output to manage ecs-test indexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/676687 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [14:54:35] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.13.0-a31 [vendor] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/676957 (https://phabricator.wikimedia.org/T277800) (owner: 10C. Scott Ananian) [14:59:33] (03PS1) 10Muehlenhoff: debian: Add an alias for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/677279 (https://phabricator.wikimedia.org/T275873) [14:59:50] PROBLEM - Check whether ferm is active by checking the default input chain on sretest1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:01:07] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10Papaul) a:05Papaul→03fgiunchedi @fgiunchedi disk replaced [15:01:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: Repool db1163', diff saved to https://phabricator.wikimedia.org/P15179 and previous config saved to /var/cache/conftool/dbconfig/20210406-150115-root.json [15:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:52] PROBLEM - MegaRAID on an-worker1100 is CRITICAL: CRITICAL: 23 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, [15:02:52] s://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:03:27] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on sretest1002.eqiad.wmnet with reason: bullseye tests [15:03:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on sretest1002.eqiad.wmnet with reason: bullseye tests [15:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:17] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephmon2004-dev - https://phabricator.wikimedia.org/T276509 (10Papaul) [15:16:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Repool db1163', diff saved to https://phabricator.wikimedia.org/P15180 and previous config saved to /var/cache/conftool/dbconfig/20210406-151619-root.json [15:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:28] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom 8 codfw appservers purchased on 2016-06-02 - https://phabricator.wikimedia.org/T277780 (10Papaul) 05Open→03Resolved complete [15:20:16] (03PS1) 10Muehlenhoff: Extend access for christinedk [puppet] - 10https://gerrit.wikimedia.org/r/677286 [15:22:47] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for christinedk [puppet] - 10https://gerrit.wikimedia.org/r/677286 (owner: 10Muehlenhoff) [15:27:05] (03PS1) 10Elukey: kerberos: set dns_canonicalize_hostname=false only where needed [puppet] - 10https://gerrit.wikimedia.org/r/677290 (https://phabricator.wikimedia.org/T278353) [15:29:02] Reedy: sbassett: around? mind if I pm? [15:29:19] Sure [15:29:49] (03PS2) 10Elukey: kerberos: set dns_canonicalize_hostname=false only where needed [puppet] - 10https://gerrit.wikimedia.org/r/677290 (https://phabricator.wikimedia.org/T278353) [15:31:15] Majavah: We're both in a meeting now, but yeah, we can check pms. [15:31:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Repool db1163', diff saved to https://phabricator.wikimedia.org/P15182 and previous config saved to /var/cache/conftool/dbconfig/20210406-153123-root.json [15:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:33] (03CR) 10Bstorm: [C: 03+1] toolforge.checker: Update list of etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/676409 (https://phabricator.wikimedia.org/T267082) (owner: 10David Caro) [15:31:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/677226 (https://phabricator.wikimedia.org/T269272) (owner: 10Jbond) [15:32:14] (03CR) 10Krinkle: [C: 03+2] [legacy] Restore old floating style inside Vector [skins/Vector] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676955 (owner: 10Krinkle) [15:32:37] sent a PM [15:33:07] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 6 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28923/console" [puppet] - 10https://gerrit.wikimedia.org/r/677290 (https://phabricator.wikimedia.org/T278353) (owner: 10Elukey) [15:33:09] (03PS1) 10Jbond: P:debmonitor::server: move the internal server function to apache [puppet] - 10https://gerrit.wikimedia.org/r/677292 [15:34:01] (03CR) 10Jbond: [C: 03+2] P:grafana: Add ForceLogin to redirects [puppet] - 10https://gerrit.wikimedia.org/r/677226 (https://phabricator.wikimedia.org/T269272) (owner: 10Jbond) [15:48:53] (03CR) 10David Caro: [C: 03+2] Revert "Horizon: put into maintenance mode for Ussuri upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/676849 (owner: 10Andrew Bogott) [15:49:16] !log Start server-side upload for 3 video files (T279189, T279188, T279183) [15:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:26] T279189: Server side upload for Sturm - https://phabricator.wikimedia.org/T279189 [15:49:26] T279188: Server side upload for Sturm - https://phabricator.wikimedia.org/T279188 [15:49:27] T279183: Server side upload for Sturm - https://phabricator.wikimedia.org/T279183 [15:50:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:52:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:53:05] 10SRE, 10observability, 10CAS-SSO, 10Patch-For-Review, and 2 others: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10jbond) With the last two patches i think i have fixed this issue can people re-test and let me know if this looks goo... [16:00:04] jbond42 and cdanis: It is that lovely time of the day again! You are hereby commanded to deploy [[Puppet request window]]
''''''. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210406T1600). [16:01:56] window is empty althu i noticed the Deployments page has been updated and it seems to have broke the output of jouncebot [16:02:23] I think Urbanecm did some fixes for jouncebot [16:02:32] RhinosF1: ack thanks [16:02:44] hm, to me that looks like the current window on the deployment calendar is different from the surrounding ones [16:02:57] unless that’s just due to the highlighting for it being the current one [16:03:02] jbond42: it first broke it completely (ie. jouncebot didn't announce anything at all). [16:03:22] then i rewrote the logic for loading of the windows, and didn't fix this small bug yet [16:03:25] yeah ok when I remove the .deploycal-event-now class it looks like a standard wikitable again [16:03:28] (03Merged) 10jenkins-bot: [legacy] Restore old floating style inside Vector [skins/Vector] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676955 (owner: 10Krinkle) [16:03:28] Urbanecm: inb that case massive improvment ;) [16:03:56] yup yup :). Better to have wikitext than nothing [16:04:01] yep :) [16:04:05] tracked in T279391 if you're curious jbond42 [16:04:06] T279391: [Regression] window name contains raw wikitext - https://phabricator.wikimedia.org/T279391 [16:04:14] thanks [16:12:12] (03PS1) 10Andrew Bogott: eqiad1 horizon -> version 'wallaby' [puppet] - 10https://gerrit.wikimedia.org/r/677298 (https://phabricator.wikimedia.org/T261138) [16:12:44] Urbanecm: it looks like jouncebot takes some html/xml-like part of the parser output and then uses .text to basically strip tags / read Node.textContent aggregate, is that correct? [16:12:51] it does this for the nicknames [16:12:59] Krinkle: do you have a scap lock on purpose or by accident? [16:12:59] I suppose that might work for the event name as well [16:13:04] andrewbogott: purpose [16:13:09] ok! [16:13:12] * andrewbogott will be patient! [16:13:16] almost done [16:13:20] Ci was taking a while [16:13:27] (03CR) 10David Caro: [C: 03+2] toolforge.checker: Update list of etcd nodes [puppet] - 10https://gerrit.wikimedia.org/r/676409 (https://phabricator.wikimedia.org/T267082) (owner: 10David Caro) [16:15:49] confirmed on mwdebug1002, rolling out now [16:16:24] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1 horizon -> version 'wallaby' [puppet] - 10https://gerrit.wikimedia.org/r/677298 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott) [16:16:40] !log krinkle@deploy1002 Synchronized php-1.36.0-wmf.37/skins/Vector/: I3234e7712b8c1 (duration: 01m 01s) [16:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:30] Krinkle: it used to work like that. I made [[wikitech:Deployments]] to output JSON blobs in hidden divs. It works fine with {{ircnick}}, but not window names [16:18:48] so now it parses just HTML-ized output of {{ircnick}} [16:18:54] better solutions welcomed :) [16:19:22] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe200[12].codfw.wmnet - https://phabricator.wikimedia.org/T275513 (10Papaul) [16:19:40] Urbanecm: well, if we go back to reading the already-expanded parser output we can use that same technique for all the data it seems. [16:20:09] probably, but that also means that any further redesign will break it without notice :) [16:20:39] I'm not sure how it works for ircnick - does that bypass the json, or do we embed html inside json and somehow escape it? I thought that was not possible since lua can't see the parser output [16:21:48] right, but I'm not sure what's more stable in the end - a documented selector and attribute, or the current workaround we find we need to use to make it work properly. [16:22:03] I think it mainly broke because it used table element names and offsets [16:22:17] if it used the class names it would not have needed changes. I intentionally kept those the same [16:22:23] the Common.js code also relies on those [16:22:44] we can also move some of the information to data attributes to make it easier to consume [16:23:04] andrewbogott: all yours [16:23:12] thanks! [16:23:42] !log andrew@deploy1002 Started deploy [horizon/deploy@df2b0b4]: Upgrade to Horizon/Wallaby [16:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:11] Urbanecm: oh I see. it works because the ircnick does not use any wikitext, only . so lua sees the template contents pssed in and it is close enough, then it replaces quotes to make it json compliant. [16:27:35] (03CR) 10Jbond: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677279 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [16:27:48] Krinkle: possibly. [16:28:08] I'd be happy with a better solution, i just did something to unbrrak it. [16:28:14] !log andrew@deploy1002 Finished deploy [horizon/deploy@df2b0b4]: Upgrade to Horizon/Wallaby (duration: 04m 32s) [16:28:17] yeah, ok [16:28:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf200[456].codfw.wmnet - https://phabricator.wikimedia.org/T275637 (10Papaul) [16:29:02] andrew@deploy1002: Failed to log message to wiki. Somebody should check the error logs. [16:35:03] (03PS2) 10Jbond: P:debmonitor::server: move the internal server function to apache [puppet] - 10https://gerrit.wikimedia.org/r/677292 [16:38:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28927/console" [puppet] - 10https://gerrit.wikimedia.org/r/677292 (owner: 10Jbond) [16:39:43] (03CR) 10Dduvall: [C: 03+2] Branch commit for wmf/1.36.0-wmf.38 [core] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/677063 (owner: 10TrainBranchBot) [16:49:47] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install moss-be200[12] - https://phabricator.wikimedia.org/T276642 (10Papaul) [16:51:37] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T275676 (10Papaul) [16:53:13] (03PS1) 10David Caro: ceph.mon: parametrize the repository to pull the packages from [puppet] - 10https://gerrit.wikimedia.org/r/677306 (https://phabricator.wikimedia.org/T274566) [16:53:15] (03PS1) 10David Caro: ceph: run tests on debian 10 buster [puppet] - 10https://gerrit.wikimedia.org/r/677307 [16:54:26] (03CR) 10jerkins-bot: [V: 04-1] ceph.mon: parametrize the repository to pull the packages from [puppet] - 10https://gerrit.wikimedia.org/r/677306 (https://phabricator.wikimedia.org/T274566) (owner: 10David Caro) [17:00:04] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for [[mw:Services|Services]] – [[mw:Extension:Graph|Graphoid]] / [[ORES]]. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210406T1700). [17:01:00] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Epic, 10Goal: WMF media storage must be adequately backed up in a remote location - https://phabricator.wikimedia.org/T262668 (10Papaul) [17:01:22] PROBLEM - Host cp2036 is DOWN: PING CRITICAL - Packet loss = 100% [17:01:26] PROBLEM - Host elastic2045 is DOWN: PING CRITICAL - Packet loss = 100% [17:01:28] PROBLEM - Host ms-be2034 is DOWN: PING CRITICAL - Packet loss = 100% [17:01:28] PROBLEM - Host ms-be2035 is DOWN: PING CRITICAL - Packet loss = 100% [17:01:56] PROBLEM - Host elastic2046 is DOWN: PING CRITICAL - Packet loss = 100% [17:02:05] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [17:02:26] !log andrew@deploy1002 Started deploy [horizon/deploy@df2b0b4]: Upgrade to Horizon/Wallaby (take two) [17:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:44] PROBLEM - Host elastic2047 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:14] PROBLEM - Host dns2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:03:48] PROBLEM - Host ms-be2055 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdnsrec site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:05:24] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:05:56] !log andrew@deploy1002 Finished deploy [horizon/deploy@df2b0b4]: Upgrade to Horizon/Wallaby (take two) (duration: 03m 30s) [17:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:20] (03PS2) 10David Caro: ceph.mon: parametrize the repository to pull the packages from [puppet] - 10https://gerrit.wikimedia.org/r/677306 (https://phabricator.wikimedia.org/T274566) [17:06:22] (03PS2) 10David Caro: ceph: run tests on debian 10 buster [puppet] - 10https://gerrit.wikimedia.org/r/677307 [17:06:42] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:07:04] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:07:18] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:12:54] wow is anybody working on --^ [17:13:19] seems a rack down [17:13:46] yep C2 [17:14:04] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.38 [core] (wmf/1.36.0-wmf.38) - 10https://gerrit.wikimedia.org/r/677063 (owner: 10TrainBranchBot) [17:14:09] papaul: hi!!! Are you working on rack C2 by any chance? [17:14:38] if not something happened to it, I see all the hosts reported down by icinga [17:15:01] here btw [17:15:17] I am going to open task [17:15:19] *a task [17:15:24] hi chaomodus :) [17:15:30] hi :) [17:16:37] 10SRE, 10ops-codfw: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10elukey) p:05Triage→03High [17:17:04] elukey: was [17:17:47] let me heck [17:17:55] papaul: hi! Thanks :) [17:18:52] the BFD alerts are related, dns2001 is in the rack [17:20:15] yes i see it switch is up pdu ok some servrs within that rack have no nic connection [17:20:56] (03CR) 10Muehlenhoff: debian: Add an alias for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677279 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [17:22:29] ah nice, maybe it is the switch's fault [17:22:40] elukey: maybe looking [17:23:54] PROBLEM - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [17:24:59] elukey: configuration of cp2036 is gone from the switch [17:25:02] xe-2/0/15 cp2036 [17:25:23] same thing for dns2001 /o\ [17:25:50] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [17:27:16] papaul: it is asw-c-codfw.mgmt.codfw.wmnet right? I don't see anything in the logs though [17:27:55] elukey: yes [17:28:15] looking to see if any error on the nic itself [17:28:22] maybe nic went bad [17:31:07] elukey: no HW errors for cp2036 in idrac [17:32:08] elukey: i thik the swithc ports went bad i tried conecting cp2036 to another port i had link on that port [17:32:49] elukey: tetng now to see if it is a bad DAC cable [17:33:44] ack [17:34:59] elukey: DAC Cable is ok it is the switch port [17:35:07] should be FPC 2 right? [17:35:18] it looks ok [17:35:20] yes [17:35:29] from the command line I mean [17:36:30] after disconnecting the DAC cable for the interface conncted to cp2036 i still have light on the interface [17:36:44] XioNoX: around?? [17:38:21] papaul: so I can connect to cp2035, that is in the same rack in theory [17:38:29] yes [17:38:44] elukey: error: device xe-2/0/15 not found [17:38:53] that is cp2036 interface [17:38:54] 10SRE, 10LDAP-Access-Requests, 10CAS-SSO: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10Dzahn) I think racktables replaces netbox for Reedy's needs and he does have access to that. This ticket is down to "redirect racktables to netbox" or "add a banner telling users it's outdated" afaict. [17:38:59] and also to lvs2009, same rack [17:39:05] yes [17:39:19] elukey: yes? [17:39:21] I check commits and didn't find anything (maybe a wrong command) [17:39:35] XioNoX: hi! :) If you have time we have a partial rack failure in codfw, C2 [17:39:52] but not all the hosts in the rack are unreachable [17:40:06] uh [17:40:17] I didn't find any failure in logs etc.. [17:40:25] it seems as if the interfaces are not there anymore [17:40:25] 8 hosts having issues [17:40:59] hahah yeah [17:41:04] so weird [17:41:20] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [17:41:20] elukey: it probably wont come up soon, so let's plan accordingly [17:41:24] XioNoX: so far no commit was done on the switch [17:41:29] or in that row [17:41:32] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [17:41:48] papaul: commit or not it would not have made the interface disapear [17:41:56] XioNoX: got yu [17:42:13] I see those as gone [17:42:17] xe-2/0/8 ms-be2034 [17:42:17] xe-2/0/9 ms-be2035 [17:42:17] xe-2/0/10 elastic2045 [17:42:17] xe-2/0/11 elastic2046 [17:42:17] xe-2/0/12 elastic2047 [17:42:17] xe-2/0/13 ms-be2055 [17:42:18] xe-2/0/14 dns2001 [17:42:18] xe-2/0/15 cp2036 [17:43:02] papaul: try to unplug/re-plug the optics [17:43:18] papaul: if not let's move them to a different switch port [17:43:46] lovely [17:44:08] gehel, ryankemper --^ FYI three elastic nodes are down in codfw [17:44:09] bblack: see the above about dns2001 and cp2036 ^ [17:44:14] XioNoX: did the upug/re-plug not goot moving to another swith port works [17:44:28] elukey: thanks, looking [17:44:55] papaul: ok, I'll let you move all of those to a different port and update netbox/config [17:45:03] papaul: you can use the netbox script for that [17:45:26] XioNoX: ok [17:46:02] Let's make a list of importance [17:46:05] elukey: when i was just about to go get some food lol [17:46:14] papaul: argh sorryyyy :( [17:46:30] dns2001 and the cp20XX nodes should be moved first [17:46:30] sorry I'm lacking context, what's up? [17:46:43] we lose a switch or something? [17:46:44] bblack: a few switch ports died [17:46:47] ok [17:46:53] bblack: hi! Apparently some switch port failures in C2 codfw, dns2001 and two cp20xx nodes down [17:46:54] elukey: ok [17:46:57] doing that now [17:46:59] ok [17:47:19] I'll more-explicitly depool both of those, so they don't flag back up during remediation/investigation [17:47:29] perfect [17:47:41] s/flag/flap/ [17:47:48] we should probably start an incident doc too :) [17:47:51] does dns2001 need special care after the switch port relocation? [17:48:26] well, my only way to manually depool that one is via console. It is already auto-depooled from service in practice (due to loss of its BGP session). [17:48:58] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp2036.codfw.wmnet [17:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:11] !log cp2036 - explicitly confctl-depooled due to switch issues [17:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:58] the first logs are "Apr 6 17:00:02 asw-c-codfw fpc2 sfp-2/0/8 SFP unplugged" [17:51:13] !log dns2001 - manually disabled puppet and stopped pdns-recursor.service (and thus implicitly BIRD) to manual-depool due to switch port issues [17:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:22] (so looks like hardware failure [17:51:49] Opened a gdoc and shared, I can take the IC role (first time, I'll try) [17:52:15] As far as the three `elastic*` nodes, the loss of the nodes has thrown us into `yellow` cluster status, which means we're not meeting our redundancy goals but we aren't experiencing any data loss. So definitely prioritize the other hosts first [17:52:16] cool thx [17:52:57] for cp2036 and dns2001 - they basically self-depooled in different ways on port loss. I was just making it explicit so they can't flap back up until we're ready for that. [17:53:24] (cp2036 was pulled from service by external healthchecks, and dns2001 was pulled from traffic flow when it became unable to advertise BGP to routers) [17:53:44] RECOVERY - Host cp2036 is UP: PING OK - Packet loss = 0%, RTA = 33.09 ms [17:53:46] RECOVERY - Host dns2001 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [17:53:51] ok dn2001 and cp2036 done [17:53:58] PROBLEM - Host 2620:0:860:3:208:80:153:77 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:16] that was quick :) [17:54:24] XioNoX: thanks to you lol [17:54:40] RECOVERY - Host 2620:0:860:3:208:80:153:77 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [17:55:54] so we moved cables to other unused ports, basically? [17:56:04] bblack: yep [17:57:44] anybody checking dns2001 now that it is up? [17:58:08] btw, it's kind of awesome that nothing functional alerted or misbehaved when we lost switch ports to 8 random hosts :) [17:58:19] elukey: it's still out-of-service manually [17:58:26] special kind of chaosmonkey :) [17:58:46] PROBLEM - Recursive DNS on 2620:0:860:3:208:80:153:77 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:58:54] are we planning to remain in this state until some future hardware replacement probably on a different day? [17:59:07] or is there more risky action planned imminently? [17:59:44] bblack: ack thanks [17:59:46] bblack: not before tomorrow most likely [18:00:01] ok [18:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210406T1800) [18:00:14] PROBLEM - Bird Internet Routing Daemon on dns2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:00:40] I'm gonna bring dns2001 + cp2036 back online then, and we'll coordinate on the hardware replacement later? [18:00:41] we will follow up with Juniper, and most likely replace the switch [18:00:56] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns2001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [18:01:00] PROBLEM - Recursive DNS on 208.80.153.77 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:01:08] there are 3 ms-be hosts down, that in theory should be fine, but it would be great to double check [18:01:17] ^ all those alerts are "expected" on dns2001, because I stopped the service manually [18:01:41] elukey: I'm guessing they're still moving those cables [18:01:44] ? [18:02:09] bblack: yep yep I was trying to figure out the current impact to swift codfw [18:02:25] I assume that since the be hosts and the same rack it is "fine" if they go down at the same time [18:02:26] will wait for confirmation we're done with physical changes (port moves) [18:03:02] bblack: ah yes so papaul only reported cp2036 and dns2001 so far [18:03:26] elukey: bblack: working on the other nodes [18:03:26] papaul: when you have a moment let me know what hosts are ready to be repooled :) [18:03:44] !log andrew@deploy1002 Started deploy [horizon/deploy@392708e]: Updating Horizon to 'main' to see if that works around T279465 [18:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:52] T279465: Horizon doesn't show image list when creating new VMs - https://phabricator.wikimedia.org/T279465 [18:04:26] bblack: the question is "how likely are other parts of the switch to fail before we replace it?" and so far I'd say unlikely [18:04:43] should I quote you in the incident doc? :D :D [18:05:45] Murphy's law is going to hit anyway [18:05:46] :) [18:05:47] Uptime 1357 days [18:07:35] (03PS1) 10Dduvall: testwikis wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677322 [18:07:37] (03CR) 10Dduvall: [C: 03+2] testwikis wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677322 (owner: 10Dduvall) [18:07:46] I'm watching netbox for the port moves, which are changing there in near-realtime it appears as well, which is kinda cool [18:07:54] !log andrew@deploy1002 Finished deploy [horizon/deploy@392708e]: Updating Horizon to 'main' to see if that works around T279465 (duration: 04m 10s) [18:08:01] https://netbox.wikimedia.org/dcim/devices/127/interfaces/ <- already has some of the cable moves moved [18:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:19] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677322 (owner: 10Dduvall) [18:09:34] automation is great when it saves time :) [18:10:39] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.36.0-wmf.38 [18:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:51] looks like all of the 8 failing ports have been emptied now in netbox. I think we're still waiting for physical confirmation from humans, and/or ping recoveries, for the remaining hosts. [18:11:10] yep! [18:11:32] I don't see recovery in icinga yet for elastic and ms-be [18:11:40] so those are still in progress [18:11:46] RECOVERY - Host elastic2045 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [18:11:46] RECOVERY - Host ms-be2035 is UP: PING OK - Packet loss = 0%, RTA = 33.09 ms [18:11:48] RECOVERY - Host ms-be2034 is UP: PING OK - Packet loss = 0%, RTA = 33.03 ms [18:11:49] eh [18:11:51] elukey: XioNoX: all nodes should be up now [18:11:54] RECOVERY - Host elastic2046 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [18:11:56] papaul: <3 [18:12:00] RECOVERY - Host ms-be2055 is UP: PING OK - Packet loss = 0%, RTA = 33.35 ms [18:12:06] RECOVERY - Host elastic2047 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [18:12:10] bblack: +1 to repool cp2036 + dns2001 if you are ok [18:12:12] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [18:13:27] elukey: XioNoX: if no one needs me i need to get out of this place i am starving [18:14:00] papaul: I think we're ok, the 8 affected hosts all have ping recovery in monitoring [18:14:06] thank you! :) [18:14:07] papaul: thanks a lot! I think that you can go, everything is recovered [18:14:11] (03CR) 10Jbond: debian: Add an alias for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677279 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [18:14:23] elukey: ack, repooling [18:14:23] bblack: elukey: ok thanks [18:14:34] papaul: lgtm! [18:14:51] !log dns2001 - re-enabling and running puppet agent to restore service [18:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:14] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:15:22] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns2001 is OK: OK: UP (pid=11363) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [18:15:26] RECOVERY - Recursive DNS on 208.80.153.77 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:15:40] RECOVERY - Recursive DNS on 2620:0:860:3:208:80:153:77 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:15:46] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:15:58] (03PS1) 10Andrew Bogott: Added openstack::clientpackages::ussuri::stretch [puppet] - 10https://gerrit.wikimedia.org/r/677323 (https://phabricator.wikimedia.org/T261136) [18:16:16] PROBLEM - Check systemd state on elastic2047 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:52] ryankemper: es hosts up, can you check if we are green again? :) [18:17:04] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 66, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:17:10] RECOVERY - Bird Internet Routing Daemon on dns2001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [18:17:22] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?viewPanel=8&orgId=1&var-DC=codfw&var-prometheus=codfw%20prometheus%2Fops&from=now-3h&to=now-1m looks good too [18:17:28] swift seems ok [18:17:32] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:18:08] !log cp2036 - re-pooling via confctl [18:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:15] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp2036.codfw.wmnet [18:18:17] XioNoX: Nov 18 2020 it asw-c7 going bad now asw-c2 [18:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:31] (03CR) 10Andrew Bogott: [C: 03+2] Added openstack::clientpackages::ussuri::stretch [puppet] - 10https://gerrit.wikimedia.org/r/677323 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [18:19:46] (03PS1) 10Jforrester: Disable LocalisationUpdate, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677325 (https://phabricator.wikimedia.org/T158360) [18:19:48] (03PS1) 10Jforrester: Disable LocalisationUpdate, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677326 (https://phabricator.wikimedia.org/T158360) [18:19:51] (03PS1) 10Jforrester: Disable LocalisationUpdate, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677327 (https://phabricator.wikimedia.org/T158360) [18:20:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:20:21] !log [urbanecm@mwmaint1002 ~/uploads]$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --sleep=3600 --user=Sturm . # T278856 [18:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:29] T278856: Server side upload for Lusccasdeutsch (master task) - https://phabricator.wikimedia.org/T278856 [18:21:16] ryankemper: I see cluster green on port 9443, we should be good [18:21:36] (03CR) 1020after4: [C: 03+1] Disable LocalisationUpdate, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677327 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [18:21:55] production and omega [18:22:07] (03CR) 1020after4: [C: 03+1] Disable LocalisationUpdate, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677326 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [18:22:15] (03CR) 1020after4: [C: 03+1] Disable LocalisationUpdate, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677325 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [18:23:59] bblack, XioNoX if everything is good on your side I'd call the incident over [18:25:25] elukey: yep [18:27:03] thanks for jumping on this everyone [18:27:11] can confirm Elasticsearch is back to full health [18:30:22] 10SRE, 10ops-codfw: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10elukey) Papaul moved the affected hosts to new switch ports, connectivity restored. Follow up to Juniper to replace the switch or the failed parts. [18:30:54] XioNoX: I previously opened https://phabricator.wikimedia.org/T279457, are we going to follow up in there for the switch/parts replacement? [18:32:00] elukey: yep, thanks, I'll follow up with Papaul [18:32:09] super [18:32:21] all right going to log off for today, nice end of the shift :D [18:32:24] thanks to all! [18:32:43] 10SRE, 10LDAP-Access-Requests, 10CAS-SSO: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10RobH) >>! In T279244#6977339, @Dzahn wrote: > I think racktables is replaced by netbox for Reedy's needs and he does have access to that. This ticket is down to "redirect racktables to netbox" or "add a... [18:33:36] nice job on IC-ing elukey :) [18:34:03] bblack: thanks! :) [18:35:42] 10SRE, 10LDAP-Access-Requests, 10CAS-SSO: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10Dzahn) Yea, it was in reference to some IRC discussion about it. Banner on racktables itself (would't have prevented this ticket because after login), or banner on CAS site (not easy to do). I think th... [18:38:49] (03CR) 10Jbond: debian: Add an alias for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/677279 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [18:41:25] (03PS1) 10Krinkle: jouncebot_preview: Add basic preview mode for quick testing [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677329 [18:41:27] (03PS1) 10Krinkle: configloader: Fix YAMLLoadWarning, use SafeLoader. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677330 [18:41:29] (03PS1) 10Krinkle: Let MW parse from page instead of downloading/uploading wikitext [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677331 [18:41:31] (03PS1) 10Krinkle: deploypage: Migrate to css selectors (fix window raw wikitext bug) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677332 (https://phabricator.wikimedia.org/T279391) [18:41:53] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.36.0-wmf.38 (duration: 33m 31s) [18:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:10] (03CR) 10jerkins-bot: [V: 04-1] configloader: Fix YAMLLoadWarning, use SafeLoader. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677330 (owner: 10Krinkle) [18:42:18] (03CR) 10jerkins-bot: [V: 04-1] jouncebot_preview: Add basic preview mode for quick testing [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677329 (owner: 10Krinkle) [18:42:20] (03CR) 10jerkins-bot: [V: 04-1] deploypage: Migrate to css selectors (fix window raw wikitext bug) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677332 (https://phabricator.wikimedia.org/T279391) (owner: 10Krinkle) [18:42:22] (03CR) 10jerkins-bot: [V: 04-1] Let MW parse from page instead of downloading/uploading wikitext [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677331 (owner: 10Krinkle) [18:42:55] (03PS2) 10Krinkle: jouncebot_preview: Add basic preview mode for quick testing [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677329 [18:42:57] (03PS2) 10Krinkle: configloader: Fix YAMLLoadWarning, use SafeLoader. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677330 [18:42:59] (03PS2) 10Krinkle: Let MW parse from page instead of downloading/uploading wikitext [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677331 [18:43:01] (03PS2) 10Krinkle: deploypage: Migrate to css selectors (fix window raw wikitext bug) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677332 (https://phabricator.wikimedia.org/T279391) [18:43:32] (03CR) 10jerkins-bot: [V: 04-1] jouncebot_preview: Add basic preview mode for quick testing [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677329 (owner: 10Krinkle) [18:43:39] (03CR) 10jerkins-bot: [V: 04-1] deploypage: Migrate to css selectors (fix window raw wikitext bug) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677332 (https://phabricator.wikimedia.org/T279391) (owner: 10Krinkle) [18:43:46] (03CR) 10jerkins-bot: [V: 04-1] configloader: Fix YAMLLoadWarning, use SafeLoader. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677330 (owner: 10Krinkle) [18:43:48] (03CR) 10jerkins-bot: [V: 04-1] Let MW parse from page instead of downloading/uploading wikitext [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677331 (owner: 10Krinkle) [18:45:16] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [18:45:51] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) 05Open→03Stalled replacement VMs wit... [18:46:37] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Majavah) [18:51:17] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) Maybe someone could run some tests from... [18:52:30] (03PS3) 10Krinkle: jouncebot_preview: Add basic preview mode for quick testing [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677329 [18:52:32] (03PS3) 10Krinkle: configloader: Fix YAMLLoadWarning, use SafeLoader. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677330 [18:52:34] (03PS3) 10Krinkle: Let MW parse from page instead of downloading/uploading wikitext [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677331 [18:52:36] (03PS3) 10Krinkle: deploypage: Migrate to css selectors (fix window raw wikitext bug) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677332 (https://phabricator.wikimedia.org/T279391) [18:52:38] (03PS1) 10Andrew Bogott: Pontoon: update openstack version to Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/677335 (https://phabricator.wikimedia.org/T261136) [18:53:26] (03CR) 10jerkins-bot: [V: 04-1] deploypage: Migrate to css selectors (fix window raw wikitext bug) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677332 (https://phabricator.wikimedia.org/T279391) (owner: 10Krinkle) [18:54:36] (03PS4) 10Krinkle: deploypage: Migrate to css selectors (fix window raw wikitext bug) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/677332 (https://phabricator.wikimedia.org/T279391) [18:58:00] (03PS1) 10Andrew Bogott: OpenStack: remove config and manifests for version 'Stein' [puppet] - 10https://gerrit.wikimedia.org/r/677337 (https://phabricator.wikimedia.org/T261136) [19:00:04] marxarelli and twentyafterfour: May I have your attention please! Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210406T1900) [19:04:44] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) boldly assigning this to @Sergey.Trofimovsky.SF now. There are 2 public IPs on the VM, gitlab1001.wikim... [19:05:05] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) a:05Dzahn→03Sergey.Trofimovsky.SF [19:06:42] (03PS1) 10Andrew Bogott: OpenStack: Add package and config for Designate/Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677340 (https://phabricator.wikimedia.org/T261137) [19:07:32] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: Add package and config for Designate/Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677340 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [19:12:56] (03PS1) 10Ottomata: Fix bug in jupyterhub-conda that was not allowing users to start new Notebook servers [puppet] - 10https://gerrit.wikimedia.org/r/677341 (https://phabricator.wikimedia.org/T224658) [19:13:22] (03CR) 10jerkins-bot: [V: 04-1] Fix bug in jupyterhub-conda that was not allowing users to start new Notebook servers [puppet] - 10https://gerrit.wikimedia.org/r/677341 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [19:14:10] (03PS2) 10Ottomata: Fix bug in jupyterhub-conda that was keeping users from starting new Notebooks [puppet] - 10https://gerrit.wikimedia.org/r/677341 (https://phabricator.wikimedia.org/T224658) [19:15:11] (03CR) 10Ottomata: [C: 03+2] Fix bug in jupyterhub-conda that was keeping users from starting new Notebooks [puppet] - 10https://gerrit.wikimedia.org/r/677341 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [19:16:33] 10SRE, 10OTRS, 10Security, 10User-notice: Migrate OTRS CE 6 to Znuny LTS fork - https://phabricator.wikimedia.org/T279303 (10Keegan) @akosiaris I assume the usual process of emails being held in queue during the migration will occur? My other question, we can assure people that if something happens this is... [19:18:57] (03PS1) 10Ottomata: Revert "Fix bug in jupyterhub-conda that was keeping users from starting new Notebooks" [puppet] - 10https://gerrit.wikimedia.org/r/676962 [19:19:23] (03CR) 10jerkins-bot: [V: 04-1] Revert "Fix bug in jupyterhub-conda that was keeping users from starting new Notebooks" [puppet] - 10https://gerrit.wikimedia.org/r/676962 (owner: 10Ottomata) [19:24:07] (03PS1) 10Dduvall: group0 wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677344 [19:24:09] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677344 (owner: 10Dduvall) [19:24:50] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.38 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677344 (owner: 10Dduvall) [19:26:47] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.38 [19:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:47] (03PS2) 10Ottomata: Revert "Fix bug in jupyterhub-conda ..." [puppet] - 10https://gerrit.wikimedia.org/r/676962 [19:30:26] (03CR) 10jerkins-bot: [V: 04-1] Revert "Fix bug in jupyterhub-conda ..." [puppet] - 10https://gerrit.wikimedia.org/r/676962 (owner: 10Ottomata) [19:31:15] (03PS3) 10Ottomata: Revert "Fix bug in jupyterhub-conda ..." [puppet] - 10https://gerrit.wikimedia.org/r/676962 [19:32:17] (03CR) 10Ottomata: [C: 03+2] Revert "Fix bug in jupyterhub-conda ..." [puppet] - 10https://gerrit.wikimedia.org/r/676962 (owner: 10Ottomata) [19:40:16] !log 1.36.0-wmf.38 rolled to group0. error rates steady and no new errors spotted (T278344) [19:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:25] T278344: 1.36.0-wmf.38 deployment blockers - https://phabricator.wikimedia.org/T278344 [19:40:38] (03PS1) 10Andrew Bogott: Designate/Victoria: remove a hacked file [puppet] - 10https://gerrit.wikimedia.org/r/677350 (https://phabricator.wikimedia.org/T261137) [19:40:38] now for cleanup [19:40:40] (03PS1) 10Andrew Bogott: Add config and manifests for Openstack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677351 (https://phabricator.wikimedia.org/T261137) [19:42:59] (03CR) 10jerkins-bot: [V: 04-1] Add config and manifests for Openstack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677351 (https://phabricator.wikimedia.org/T261137) (owner: 10Andrew Bogott) [19:45:08] !log dduvall@deploy1002 Pruned MediaWiki: 1.36.0-wmf.34 (duration: 03m 37s) [19:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:35] (03CR) 10Razzi: [C: 03+2] superset: add victorops contact to superset monitoring [puppet] - 10https://gerrit.wikimedia.org/r/675898 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [19:47:37] !log dduvall@deploy1002 Pruned MediaWiki: 1.36.0-wmf.35 (duration: 02m 02s) [19:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:40] !log dduvall@deploy1002 Pruned MediaWiki: 1.36.0-wmf.36 (duration: 01m 50s) [19:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:19] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Sergey.Trofimovsky.SF) >>! In T276148#6977653, @Dzahn wrote: > boldly assigning this to @Sergey.Trofimovsky.SF n... [20:14:59] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) Personally I think this (the IP address) would be a good line to do the separation at. We handle the admi... [20:16:22] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10brennen) > Personally I think this (the IP address) would be a good line to do the separation at. We handle the... [20:17:02] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-psi-eqiad on cloudelastic1006 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-psi-eqiad&var-instance=cloudelastic1006&panelId=37 [20:18:52] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Sergey.Trofimovsky.SF) Sounds like a plan to me then. [20:20:02] (03PS1) 10Andrew Bogott: Horizon: show up to a million projects in the project-selection menu [puppet] - 10https://gerrit.wikimedia.org/r/677358 [20:20:34] (03PS2) 10Andrew Bogott: Horizon: show up to a million projects in the project-selection menu [puppet] - 10https://gerrit.wikimedia.org/r/677358 [20:22:31] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: show up to a million projects in the project-selection menu [puppet] - 10https://gerrit.wikimedia.org/r/677358 (owner: 10Andrew Bogott) [20:28:50] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1006-cloudelastic-psi-eqiad on cloudelastic1006 is OK: (C)100 gt (W)80 gt 52.88 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-psi-eqiad&var-instance=cloudelastic1006&panelId=37 [20:30:04] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [20:30:34] ^ any recent changes to monitoring? [20:32:07] "Error: Contact group 'victorops-analytics' specified in service 'superset' for host 'an-tool1005' " [20:32:10] this is the issue [20:32:34] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [20:32:50] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [20:33:06] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Papaul) Case Number:2021-0406-0609 create [20:34:37] razzi: hi, so this happened: <+icinga-wm> PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors [20:34:46] and then to see the actual error I ran this: [20:34:53] [alert1001:~] $ sudo icinga -v /etc/icinga/icinga.cfg [20:35:12] which told me about "Error: Contact group 'victorops-analytics' specified in service 'superset' for host " [20:35:32] alright, let me roll that back for now, which should fix that [20:35:36] that seems like that contact group does not exist yet [20:35:47] but you could also fix it by adding the group [20:35:49] in the private repo [20:36:05] up to you which way you prefer [20:36:20] you just need the group in the private repo before using it in the public repo [20:36:44] the only thing that is broken right now is "cant add new checks to Icinga" [20:36:53] but it's not like an existing thing is broken [20:37:01] unless we hard restart the service [20:37:09] ok whew [20:37:26] in the past puppet would do the restart [20:37:35] but we taught it to stop doing that and just alert instead [20:37:37] for that reason [20:39:11] mutante: do you know where that group would go in the private repo? [20:39:12] razzi: sorry, one part I told you is wrong actually. since this is about a contactgroup and not a contact. it is not private repo. it is both public repo [20:39:32] the group members are in private, the groups containing members are public [20:40:17] razzi: make sure a group with that name exists in modules/nagios_common/files/contactgroups.cfg [20:40:30] then run puppet on alert1001, should fix it [20:40:32] (03PS2) 10Andrew Bogott: Designate/Victoria: remove a hacked file [puppet] - 10https://gerrit.wikimedia.org/r/677350 (https://phabricator.wikimedia.org/T261137) [20:40:34] (03PS2) 10Andrew Bogott: Add config and manifests for Openstack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677351 (https://phabricator.wikimedia.org/T261137) [20:40:37] (03PS1) 10Andrew Bogott: Keystone: get prod_ and labs_networks from a profile [puppet] - 10https://gerrit.wikimedia.org/r/677359 [20:40:39] (03PS1) 10Andrew Bogott: Removed an uneeded setting from nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/677360 (https://phabricator.wikimedia.org/T261137) [20:41:51] razzi: when I look at the file I see existing "victorops-fundraising" but the thing is..that is a member of a group, not a group [20:42:32] you might need to "define contactgroup" just to contain a member with the same name [20:42:55] to then be able to use that as value for a "contact_group" parameter [20:43:39] (03PS2) 10Andrew Bogott: Keystone: get prod_ and labs_networks from a profile [puppet] - 10https://gerrit.wikimedia.org/r/677359 [20:43:41] (03PS3) 10Andrew Bogott: Designate/Victoria: remove a hacked file [puppet] - 10https://gerrit.wikimedia.org/r/677350 (https://phabricator.wikimedia.org/T261137) [20:43:42] mutante: let me roll back my change for now, then figure out how contactgroups.cfg is supposed to work [20:43:43] (03PS3) 10Andrew Bogott: Add config and manifests for Openstack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677351 (https://phabricator.wikimedia.org/T261137) [20:43:45] (03PS2) 10Andrew Bogott: Removed an uneeded setting from nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/677360 (https://phabricator.wikimedia.org/T261137) [20:44:32] razzi: sure, that works too. so.. what is called "victorops" and "victorops-fundraising" are 'people' in the eyes of Icinga, not groups [20:45:19] the definition for the 'people' (contacts) are in private/modules/secret/secrets/nagios/contacts.cfg [20:45:46] the contactgroups are in public/modules/nagios_common/files/contactgroups.cfg [20:46:22] you first need the contact in private, then put it in a group in public, then use group in parameters [20:46:46] 10SRE, 10ops-codfw, 10netops: Multiple host down alerts from rack C2 - https://phabricator.wikimedia.org/T279457 (10Peachey88) [20:47:17] after revert you can run puppet on alert1001 and then run sudo icinga -v /etc/icinga/icinga.cfg to check for 0 errors/warnings and then the alert should recover soon after [20:48:04] you can do the same after re-adding groups/contacts later to check if it works [20:48:06] (03PS1) 10Razzi: superset: Temporarily rollback victorops-analytics contact_group [puppet] - 10https://gerrit.wikimedia.org/r/677362 (https://phabricator.wikimedia.org/T273064) [20:48:21] (03PS3) 10Andrew Bogott: Keystone: get prod_ and labs_networks from a profile [puppet] - 10https://gerrit.wikimedia.org/r/677359 [20:48:23] (03PS4) 10Andrew Bogott: Designate/Victoria: remove a hacked file [puppet] - 10https://gerrit.wikimedia.org/r/677350 (https://phabricator.wikimedia.org/T261137) [20:48:25] (03PS4) 10Andrew Bogott: Add config and manifests for Openstack version Victoria [puppet] - 10https://gerrit.wikimedia.org/r/677351 (https://phabricator.wikimedia.org/T261137) [20:48:27] (03PS3) 10Andrew Bogott: Removed an uneeded setting from nova.conf [puppet] - 10https://gerrit.wikimedia.org/r/677360 (https://phabricator.wikimedia.org/T261137) [20:50:02] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: get prod_ and labs_networks from a profile [puppet] - 10https://gerrit.wikimedia.org/r/677359 (owner: 10Andrew Bogott) [20:50:24] (03CR) 10Dzahn: [C: 03+1] superset: Temporarily rollback victorops-analytics contact_group [puppet] - 10https://gerrit.wikimedia.org/r/677362 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [20:52:08] (03CR) 10Razzi: [C: 03+2] superset: Temporarily rollback victorops-analytics contact_group [puppet] - 10https://gerrit.wikimedia.org/r/677362 (https://phabricator.wikimedia.org/T273064) (owner: 10Razzi) [20:54:07] 10SRE, 10serviceops, 10Sustainability: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) >>! In T275752#6948295, @fgiunchedi wrote: >>>! In T275752#6935663, @Legoktm wrote: >> @fgiunchedi could you re-run your analysis to see if mw1307 (10.64.... [20:57:48] hmm mutante I merged the change and ran puppet, but I'm still seeing the error when I `sudo icinga -v /etc/icinga/icinga.cfg` [20:58:41] razzi: ah, this is due to exported resources. try running puppet on the 2 hosts you actually monitor and then again on the icinga host [20:58:49] gotcha, will try that [21:01:38] ok mutante looks like that fixed it! Thanks so much for keeping an eye on that and troubleshooting [21:01:59] razzi: cool :) np [21:02:17] so yea, i think the fix will be to add a group with the same name [21:02:22] that has only one member [21:02:52] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [21:03:08] and there is the icinga-meta check, nice ! [21:03:56] mutante: cool, I'm busy for a bit, will look into later. Do you know if there's a way to check that it'll work before merging? [21:05:19] 10SRE, 10OTRS, 10Security, 10User-notice: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10Krenair) >>! In T275294#6972793, @Keegan wrote: > You might be interested in something else we may need to do in the near future, move OTRS wiki... [21:05:23] razzi: puppet compiler never hurts but it wont detect this specific case that is just internal to icinga. best you can do is.. merge..then immediately run puppet yourself and run that command from above to check it says "0 errors/warnings" and if not just revert [21:06:49] razzi: well, there is another option. disable puppet on alert*, merge, re-enable puppet on alert2001. check it works there with the same method as above, if ok, re-enable puppet on alert1001 [21:29:37] sbassett: a second? [21:29:55] RhinosF1: Sure, maybe a couple :) [21:30:42] Amir1: do you have some time for me? [21:31:03] W13: What can I do to help? [21:32:14] I believe earlier today the old js globals were removed, unfortunately it broke somewhat often used anti-vandalism tool [21:32:34] I do know where it fails, but my lack of JS prevents me to fix it :( [21:33:07] (on nlwiki, btw, forgot to add) [21:35:51] if you don't have time now I understand, then I need to figure out something else [21:48:10] 10SRE, 10Wikimedia-Mailing-lists: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) Sure, it is not urgent. But what would be mailman3? It is a better mailist? [21:54:55] (03CR) 10Krinkle: [C: 03+1] "LGTM. Might be worth conditioning on beta realm for a few days just to build up some confidence." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677325 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [21:59:23] (03PS1) 10Bartosz Dziewoński: Fix missing styles on diff [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676964 (https://phabricator.wikimedia.org/T279099) [22:02:03] (03PS1) 10Bartosz Dziewoński: Disable upcoming DiscussionTools for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677373 [22:03:14] 10SRE, 10Wikimedia-Mailing-lists: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Ladsgroup) Yes, for example it has less encoding issues or is easier to use and much more secure. I tried to explain it here: https://lists.wikimedia.org/pipermail/wikitech-l/2021-March/094382.html [22:05:07] 10SRE, 10Wikimedia-Mailing-lists: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) OK let's wait. Thanks. Will be possible to create two maillists for our group? One closed to administrators and one open to the public? [22:08:44] (03PS2) 10Bartosz Dziewoński: Disable upcoming DiscussionTools features for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677373 [22:12:58] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [22:22:18] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:24:22] PROBLEM - Thanos store has high latency for series gate requests on alert1001 is CRITICAL: job=thanos-store https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [22:28:42] RECOVERY - Thanos store has high latency for series gate requests on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/e832e8f26403d95fac0ea1c59837588b/thanos-store [22:29:41] 10SRE, 10Wikimedia-Mailing-lists: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Ladsgroup) I can't decide on that, That's for the lists admins to decide. [22:37:50] Random wish: Can we get rid of this please :( https://meta.wikimedia.org/wiki/Www.wikibooks.org_template [22:40:50] 10SRE, 10Wikimedia-Mailing-lists: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10felipedafonseca) Ok, and I suppose I need to talk to then latter, after de mailman3 is ready, right? [22:58:55] 10SRE, 10SRE-Access-Requests: Please provide deployment privilege - https://phabricator.wikimedia.org/T279489 (10HMonroy) 05Open→03Resolved [23:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) [[Backport windows|Evening backport window]]
'''''' deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210406T2300). [23:00:04] MatmaRex: A patch you scheduled for [[Backport windows|Evening backport window]]
'''''' is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:20] i can deploy today [23:00:31] Amir1: i think we have a task for it somewhere [23:00:45] MatmaRex: around? [23:01:06] hi [23:01:13] let me look [23:01:30] (03CR) 10Urbanecm: [C: 03+2] Fix missing styles on diff [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676964 (https://phabricator.wikimedia.org/T279099) (owner: 10Bartosz Dziewoński) [23:01:33] my config change is a no-op [23:01:43] (03CR) 10Urbanecm: [C: 03+2] Disable upcoming DiscussionTools features for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677373 (owner: 10Bartosz Dziewoński) [23:01:52] MatmaRex: ack. Do you want to test it on a mwdebug host? [23:02:00] can't really test anything [23:02:06] ok [23:02:22] since the code is not merged yet [23:02:31] got it [23:02:39] so it's just introducing unused variables for now [23:02:44] (03Merged) 10jenkins-bot: Disable upcoming DiscussionTools features for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677373 (owner: 10Bartosz Dziewoński) [23:02:51] yes [23:02:57] okay, I'll sync it [23:03:02] so that we can have them default to being enabled in the source code [23:03:09] which makes testing locally easier [23:03:33] it's on its way :) [23:04:37] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4d12a8672f0f930cec75fc2fc42abc8899d93088: Disable upcoming DiscussionTools features for now (duration: 01m 08s) [23:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:46] and done MatmaRex [23:04:51] waiting for CI now [23:05:18] thanks [23:10:16] (03PS1) 10Urbanecm: thwikisource: Enable transwiki import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677377 (https://phabricator.wikimedia.org/T275281) [23:11:35] (03CR) 10Urbanecm: [C: 03+2] thwikisource: Enable transwiki import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677377 (https://phabricator.wikimedia.org/T275281) (owner: 10Urbanecm) [23:12:23] (03Merged) 10jenkins-bot: thwikisource: Enable transwiki import [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677377 (https://phabricator.wikimedia.org/T275281) (owner: 10Urbanecm) [23:15:52] heads up, I'm importing a couple large mailing lists [23:16:23] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 997b6f35b544f2ae7339c4c62d303b980c307c3a: thwikisource: Enable transwiki import (T275281) (duration: 01m 08s) [23:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:31] T275281: Enable import on thwikisource - https://phabricator.wikimedia.org/T275281 [23:19:31] > sudo: time: command not found [23:19:31] you kidding me :( [23:20:00] aha, it should be the other way around [23:28:58] til it's a built in construct [23:29:32] (i'm away for a minute) [23:30:22] ack [23:31:21] (03Merged) 10jenkins-bot: Fix missing styles on diff [core] (wmf/1.36.0-wmf.37) - 10https://gerrit.wikimedia.org/r/676964 (https://phabricator.wikimedia.org/T279099) (owner: 10Bartosz Dziewoński) [23:32:05] MatmaRex: ping me when back [23:32:06] (back) [23:32:09] oh, just in time [23:32:09] good [23:33:27] MatmaRex: fetched to mwdebug1001; can you test? [23:33:39] yeah. looks good [23:33:42] syncing [23:36:05] !log urbanecm@deploy1002 Synchronized php-1.36.0-wmf.37/resources/src/: b8a0dabb1599dfd3fc08c878a5f23f7e7fce08e9: Fix missing styles on diff (T279099) (duration: 01m 08s) [23:36:12] MatmaRex: done. Anything else? [23:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:14] T279099: Minor edit marker no longer bolded on diffs - https://phabricator.wikimedia.org/T279099 [23:36:37] 10SRE, 10Wikimedia-Mailing-lists: Wikisul maillist - https://phabricator.wikimedia.org/T279482 (10Ladsgroup) yup [23:37:05] thanks Urbanecm [23:37:16] np [23:45:11] (03PS1) 10Papaul: DHCP Add MAC for new mw servers in A5 [puppet] - 10https://gerrit.wikimedia.org/r/677380 (https://phabricator.wikimedia.org/T274171) [23:46:24] (03CR) 10Papaul: [C: 03+2] DHCP Add MAC for new mw servers in A5 [puppet] - 10https://gerrit.wikimedia.org/r/677380 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul) [23:50:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2397.codfw.wmnet ` The log can be found in `/... [23:55:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install (35) mw2377 and upwards - https://phabricator.wikimedia.org/T274171 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` mw2398.codfw.wmnet ` The log can be found in `/...