[01:20:38] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:02:52] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 72 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:08:38] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 41 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:27:30] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:55:28] 10SRE, 10Wikimedia-Mailing-lists: Enourmous mailman3 outgoing queue - https://phabricator.wikimedia.org/T284003 (10Legoktm) Filed upstream: https://gitlab.com/mailman/mailman/-/issues/909 [03:55:57] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) Filed upstream at https://gitlab.com/mailman/mailman/-/issues/910, waiting for my other MR to be merged before submitting this new one. [04:03:52] PROBLEM - SSH on wdqs2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:46:07] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1019:3314 [puppet] - 10https://gerrit.wikimedia.org/r/698366 [04:46:42] ACKNOWLEDGEMENT - Disk space on dbprov2003 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=0%): Marostegui T284415 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops [04:46:42] ACKNOWLEDGEMENT - Disk space on dbprov2003 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=0%): Marostegui T284415 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops [04:46:58] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1019:3314 [puppet] - 10https://gerrit.wikimedia.org/r/698366 (owner: 10Marostegui) [04:48:13] !log Depool clouddb1019:3314 (long running alter table) [04:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:33] 10SRE, 10MW-on-K8s, 10serviceops: Add the puppet CA to the MediaWiki deployment - https://phabricator.wikimedia.org/T284417 (10Joe) [04:57:36] (03CR) 10Legoktm: Add qqq (032 comments) [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685533 (owner: 10Legoktm) [04:57:38] 10SRE, 10MW-on-K8s, 10serviceops: Add conditional to mediawiki-config for stuff running on kubernetes - https://phabricator.wikimedia.org/T284418 (10Joe) [04:58:01] 10SRE, 10MW-on-K8s, 10serviceops: Add the puppet CA to the MediaWiki deployment - https://phabricator.wikimedia.org/T284417 (10Joe) p:05Triage→03High [04:58:17] 10SRE, 10MW-on-K8s, 10serviceops: Add conditional to mediawiki-config for stuff running on kubernetes - https://phabricator.wikimedia.org/T284418 (10Joe) p:05Triage→03High [05:00:05] 10SRE, 10MW-on-K8s, 10serviceops: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10Joe) [05:00:14] 10SRE, 10MW-on-K8s, 10serviceops: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10Joe) p:05Triage→03High [05:02:11] 10SRE, 10MW-on-K8s, 10serviceops: Enable TLS termination on the mwdebug deployment. fix the service definition - https://phabricator.wikimedia.org/T284421 (10Joe) [05:05:03] 10SRE, 10MW-on-K8s, 10serviceops: Enable TLS termination on the mwdebug deployment. fix the service definition in the chart - https://phabricator.wikimedia.org/T284421 (10Joe) p:05Triage→03High [05:05:24] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Patch-For-Review: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Legoktm) I'm a bit behind on this, some other higher priority MM issues have come up, but I'm still trying to make progress on it. * Why are... [05:08:04] (03PS1) 10Marostegui: db2113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698370 (https://phabricator.wikimedia.org/T283235) [05:08:55] (03CR) 10Marostegui: [C: 03+2] db2113: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/698370 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui) [05:13:38] 10SRE, 10MW-on-K8s, 10serviceops: Add the puppet CA to the MediaWiki deployment - https://phabricator.wikimedia.org/T284417 (10Joe) a:03Joe [05:20:12] PROBLEM - Long running screen/tmux on an-launcher1002 is CRITICAL: CRIT: Long running SCREEN process. (user: analytics PID: 12393, 1739021s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [05:26:57] (03CR) 10Marostegui: "The candidate master in s5 codfw has been reimaged, I am going to check its tables, and tomorrow if all is fine, I will reimage codfw s5 m" [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) (owner: 10Jcrespo) [05:27:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2113.codfw.wmnet with reason: REIMAGE [05:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2113.codfw.wmnet with reason: REIMAGE [05:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:58] (03Abandoned) 10Effie Mouzeli: mwdebug: more fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/697940 (owner: 10Effie Mouzeli) [05:35:07] (03Abandoned) 10Effie Mouzeli: mwdebug: fix nutcracker pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/697915 (owner: 10Effie Mouzeli) [05:36:16] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/698371 (https://phabricator.wikimedia.org/T283235) [05:36:54] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/698371 (https://phabricator.wikimedia.org/T283235) (owner: 10Marostegui) [05:37:46] !log Depool clouddb1020 (s5, s8) for upgrade [05:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:05] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1020" [puppet] - 10https://gerrit.wikimedia.org/r/698353 [05:39:49] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1020" [puppet] - 10https://gerrit.wikimedia.org/r/698353 (owner: 10Marostegui) [05:40:19] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::alerts fix panelId for mediawiki exceptions alert [puppet] - 10https://gerrit.wikimedia.org/r/690540 (https://phabricator.wikimedia.org/T284301) (owner: 10Effie Mouzeli) [05:45:05] RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.143e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [05:57:25] !log Stop dbstore1004 to clone dbstore1007 T283125 [05:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:30] T283125: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 [06:05:55] !log Upgrade mysql on dbstore1003 T283235 [06:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:01] T283235: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235 [06:07:37] (03PS2) 10ArielGlenn: dumps: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/698309 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [06:11:34] (03CR) 10Elukey: "As described in https://phabricator.wikimedia.org/T280661, cert-manager should not be needed so we can easily skip this patch." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T280661) (owner: 10Elukey) [06:11:43] (03CR) 10ArielGlenn: [C: 03+2] dumps: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/698309 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [06:20:40] (03CR) 10Marostegui: "@razzi let's try to get this merged today after fixing Luca's point? The transfer is going and I expect it to be done by tomorrow." [homer/public] - 10https://gerrit.wikimedia.org/r/697704 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [06:22:54] marostegui: not blocked by --^ right? [06:23:00] otherwise I can take care of it [06:27:04] (03CR) 10ArielGlenn: [C: 03+1] "Looks like a no-op, PCC says it's a no-op, good enough for me!" [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff) [06:28:19] elukey: no, the transfer will take a day [06:28:23] so not blocked yet :) [06:29:08] ack, ping me in case! [06:29:59] elukey: thanks <3 [06:30:48] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:08] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [06:35:11] (03CR) 10ArielGlenn: [C: 03+1] "Yep this is fine." [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [06:42:08] the latency is not really good [06:46:17] <_joe_> yes [06:46:23] <_joe_> it's just on POST though? [06:46:27] appserver only, POST only yes [06:46:30] <_joe_> yep [06:46:38] <_joe_> it looks a lot like some timeout [06:46:46] <_joe_> marostegui: any trouble on the databases? [06:48:25] <_joe_> elukey: so the p75 didn't increase [06:48:35] <_joe_> but the p95 is now around 40 seconds [06:49:30] there is also an increase in traffic towards memcached shards afaics [06:49:55] following the same timing [06:50:05] <_joe_> I see some replicas are lagged [06:50:39] <_joe_> well no they were 6 minutes ago [06:51:13] <_joe_> elukey: now it's matter of finding if it's a specific type of request that takes this long [06:51:58] I was about to say the same, can't see any clear problem in db metrics [06:53:18] <_joe_> it's all stuff like [06:53:22] <_joe_> http://en.wikipedia.org/w/index.php?title=Special:Export&pages=The_Beatles_&offset=2017-11-18T00:05:50Z&action=submit [06:54:01] <_joe_> see https://logstash.wikimedia.org/goto/21492c9774d85cedee69dd51ed985784 [06:54:31] <_joe_> it's a single bot [06:56:42] yep good catch [06:59:20] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [07:02:43] <_joe_> the bot owner seems to read us, they stopped? [07:04:43] maybe they realized that the latency was terrible and wondered what's up [07:05:46] RECOVERY - SSH on wdqs2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:08:11] (03PS2) 10ZPapierski: Push the limit for shads queried in relforge [puppet] - 10https://gerrit.wikimedia.org/r/688309 [07:08:13] (03PS1) 10ZPapierski: Enable blank node skolemization for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/698451 (https://phabricator.wikimedia.org/T284040) [07:13:42] (03CR) 10DCausse: [C: 03+1] Enable blank node skolemization for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/698451 (https://phabricator.wikimedia.org/T284040) (owner: 10ZPapierski) [07:28:17] (03CR) 10Muehlenhoff: [C: 03+2] htmldumps: Switch to common profile::nginx [puppet] - 10https://gerrit.wikimedia.org/r/698207 (https://phabricator.wikimedia.org/T163356) (owner: 10Muehlenhoff) [07:29:24] (03PS1) 10Effie Mouzeli: mediawiki: fix ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/698454 (https://phabricator.wikimedia.org/T284421) [07:30:05] (03PS3) 10Muehlenhoff: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456) [07:31:30] (03CR) 10Muehlenhoff: [C: 03+2] Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698215 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [07:35:32] (03PS1) 10Muehlenhoff: Revert "Switch htmldumps to nginx-light" [puppet] - 10https://gerrit.wikimedia.org/r/698455 [07:37:23] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Switch htmldumps to nginx-light" [puppet] - 10https://gerrit.wikimedia.org/r/698455 (owner: 10Muehlenhoff) [07:37:28] PROBLEM - Check systemd state on htmldumper1001 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:15] RECOVERY - Check systemd state on htmldumper1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:22] (03PS1) 10Giuseppe Lavagetto: mediawiki: add ca-bundle to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/698456 (https://phabricator.wikimedia.org/T284417) [07:44:59] (03PS2) 10ZPapierski: Enable blank node skolemization for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/698451 (https://phabricator.wikimedia.org/T284040) [07:47:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, consider removing the useless values (see comments) as optional" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/698454 (https://phabricator.wikimedia.org/T284421) (owner: 10Effie Mouzeli) [07:48:31] (03CR) 10Jcrespo: [C: 03+2] [T284399] Perform first commit on operations/bernard repository, add .gitignore and README.md [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123) [07:49:29] (03PS2) 10Giuseppe Lavagetto: mediawiki: add ca-bundle to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/698456 (https://phabricator.wikimedia.org/T284417) [07:50:34] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add ca-bundle to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/698456 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [07:52:59] (03CR) 10Jcrespo: [C: 03+2] "For wikimedia style, no need for brackets on the subject- we mark the bug at the end like you did. We don't do [bug] or other things like " [software/bernard] - 10https://gerrit.wikimedia.org/r/698327 (https://phabricator.wikimedia.org/T284399) (owner: 10H.krishna123) [07:53:03] (03PS1) 10Muehlenhoff: htmldumps: Fully remove rate limit settings [puppet] - 10https://gerrit.wikimedia.org/r/698457 [07:53:05] (03PS1) 10Muehlenhoff: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698458 (https://phabricator.wikimedia.org/T164456) [07:53:41] (03PS1) 10Filippo Giunchedi: alertmanager: print link separators on IRC when needed [puppet] - 10https://gerrit.wikimedia.org/r/698459 (https://phabricator.wikimedia.org/T282806) [07:53:49] (03CR) 10jerkins-bot: [V: 04-1] Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698458 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [07:55:13] wut [07:58:33] (03PS2) 10Effie Mouzeli: mediawiki: fix ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/698454 (https://phabricator.wikimedia.org/T284421) [08:00:36] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) The discovery has worked as expected now and data is collected, I'm seeing only one input for volts/hertz/current as opposed to individual phases, an... [08:01:05] (03CR) 10JMeybohm: "All/Most of the other places to call the configmap-key "puppetca.crt.pem". You think it would make sense to use that here as well?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/698456 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [08:01:32] (03PS3) 10Giuseppe Lavagetto: mediawiki: add ca-bundle to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/698456 (https://phabricator.wikimedia.org/T284417) [08:02:55] (03CR) 10Gehel: [C: 03+2] Enable blank node skolemization for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/698451 (https://phabricator.wikimedia.org/T284040) (owner: 10ZPapierski) [08:07:29] 10SRE, 10docker-pkg, 10serviceops: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Joe) [08:07:42] 10SRE, 10docker-pkg, 10serviceops: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Joe) p:05Triage→03Medium [08:09:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: fix ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/698454 (https://phabricator.wikimedia.org/T284421) (owner: 10Effie Mouzeli) [08:15:16] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: fix ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/698454 (https://phabricator.wikimedia.org/T284421) (owner: 10Effie Mouzeli) [08:26:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [08:29:15] (03CR) 10Muehlenhoff: [C: 03+1] "Good catch, confirmed with "ldapsearch -x uidNumber=10972"." [puppet] - 10https://gerrit.wikimedia.org/r/698205 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite) [08:29:30] (03CR) 10Jcrespo: "Question- this seems to avoid excluding backups- but backups of mm2 still happen regularly? I suppose that is planned? If shutdown (and no" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [08:31:40] RECOVERY - Disk space on dbprov2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbprov2003&var-datasource=codfw+prometheus/ops [08:44:15] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:45:45] (03CR) 10Ladsgroup: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [08:51:25] (03CR) 10Ladsgroup: "oh I need to say that we have deleted a lot (83GB) from that directory so creating backups would be much much faster and smaller. I will c" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [08:59:18] (03PS1) 10Jbond: R:systemd::timer::job: use the syslog_identier for the syslog programname [puppet] - 10https://gerrit.wikimedia.org/r/698462 (https://phabricator.wikimedia.org/T251918) [09:01:12] (03CR) 10jerkins-bot: [V: 04-1] R:systemd::timer::job: use the syslog_identier for the syslog programname [puppet] - 10https://gerrit.wikimedia.org/r/698462 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [09:02:42] (03PS1) 10Ayounsi: Add Python 3.9 support [software/homer] - 10https://gerrit.wikimedia.org/r/698463 [09:03:22] (03CR) 10jerkins-bot: [V: 04-1] Add Python 3.9 support [software/homer] - 10https://gerrit.wikimedia.org/r/698463 (owner: 10Ayounsi) [09:03:49] (03PS2) 10Jbond: R:systemd::timer::job: use the syslog_identier for the syslog programname [puppet] - 10https://gerrit.wikimedia.org/r/698462 (https://phabricator.wikimedia.org/T251918) [09:03:53] !log installing imagemagick security updates on stretch [09:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:48] (03PS16) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [09:04:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29803/console" [puppet] - 10https://gerrit.wikimedia.org/r/698462 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [09:05:48] (03PS2) 10Ayounsi: Add Python 3.9 support [software/homer] - 10https://gerrit.wikimedia.org/r/698463 [09:06:36] (03PS3) 10Ayounsi: Add Python 3.9 support [software/homer] - 10https://gerrit.wikimedia.org/r/698463 [09:06:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29804/console" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:10:20] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10Cervisiarius) Thanks all! I checked, and I can now use Jupyter Hub. I saw that the Kerberos issue is being handled in task https://phabricator.wikimedia.org/T284022, so I'll keep an... [09:10:22] (03CR) 10Ayounsi: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/698463 (owner: 10Ayounsi) [09:11:44] (03CR) 10Jbond: [V: 03+1] "thanks fixed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [09:11:52] (03PS9) 10Jbond: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 [09:12:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29805/console" [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [09:12:57] (03CR) 10jerkins-bot: [V: 04-1] mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [09:16:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:17:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [09:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [09:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, but I'm not super familiar with timer::job" [puppet] - 10https://gerrit.wikimedia.org/r/698462 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [09:22:29] (03PS1) 10Jbond: P:openldap::management: Ignore errors ffor account consistency check [puppet] - 10https://gerrit.wikimedia.org/r/698464 [09:23:02] (03CR) 10Ayounsi: "Note that I'm currently hitting https://github.com/PyCQA/prospector/issues/418 on my laptop." [software/homer] - 10https://gerrit.wikimedia.org/r/698463 (owner: 10Ayounsi) [09:23:28] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:23:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29806/console" [puppet] - 10https://gerrit.wikimedia.org/r/698464 (owner: 10Jbond) [09:25:02] (03CR) 10Jcrespo: "> Does that answer your question?" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [09:25:12] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 34 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:28:55] (03CR) 10Jcrespo: "The patch is technically correct (it will configure remote backups), I am just waiting on confirmation that the backup strategy will work " [puppet] - 10https://gerrit.wikimedia.org/r/697850 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [09:35:51] (03CR) 10MSantos: [C: 03+1] Rename maps-vector-server to tegola-vector-tiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/693917 (https://phabricator.wikimedia.org/T283159) (owner: 10Effie Mouzeli) [09:36:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/698464 (owner: 10Jbond) [09:36:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [09:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:32] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:openldap::management: Ignore errors ffor account consistency check [puppet] - 10https://gerrit.wikimedia.org/r/698464 (owner: 10Jbond) [09:37:36] (03CR) 10ArielGlenn: "Go to it!" [puppet] - 10https://gerrit.wikimedia.org/r/698457 (owner: 10Muehlenhoff) [09:39:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [09:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:15] (03PS2) 10ArielGlenn: Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698458 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [09:42:15] (03CR) 10ArielGlenn: [C: 03+1] Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698458 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [09:43:48] !log upgrading bullseye hosts to latest packages in testing [09:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:09] 10SRE, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog: tegola-vector-tiles load testing and Swift throughput experiments - https://phabricator.wikimedia.org/T284440 (10MSantos) [09:48:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [09:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:10] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 118 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:51:47] PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2021-06-04 09:18:11 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:52:35] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 23 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:52:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [09:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:01] PROBLEM - SSH on mw1284.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:54:15] 10SRE, 10SRE-tools, 10User-jbond: Rootless cookbooks/spicerack - https://phabricator.wikimedia.org/T284302 (10Volans) [09:55:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host theemin.codfw.wmnet [09:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:55] 10SRE, 10SRE-tools, 10User-jbond: Rootless cookbooks/spicerack - https://phabricator.wikimedia.org/T284302 (10Volans) I'm not totally sure about the `init_rootless()` part of the API, but we can refine it as we implement it depending on the use cases. [09:57:29] 10Puppet, 10SRE-OnFire, 10User-jbond: Create SRE checklist for puppet - https://phabricator.wikimedia.org/T284073 (10jbond) 05Open→03Stalled First draft of this has been sent to the shared gdoc, awaiting review [10:03:00] (03CR) 10Volans: [C: 03+1] "LGTM, one comment inline" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/698463 (owner: 10Ayounsi) [10:05:39] (03PS1) 10Kormat: db1157: Disable notificcations. [puppet] - 10https://gerrit.wikimedia.org/r/698469 (https://phabricator.wikimedia.org/T283131) [10:07:39] (03PS1) 10MSantos: Trigger tegola latest build [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/698470 [10:07:53] (03PS2) 10Kormat: db1157: Disable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/698469 (https://phabricator.wikimedia.org/T283131) [10:08:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db1157 depooling: reimage to buster T283131', diff saved to https://phabricator.wikimedia.org/P16311 and previous config saved to /var/cache/conftool/dbconfig/20210607-100822-kormat.json [10:08:25] (03CR) 10Jcrespo: "This is technically correct, but I'd prefer to use the modern workflow if possible. See comment on phabricator." [puppet] - 10https://gerrit.wikimedia.org/r/698251 (https://phabricator.wikimedia.org/T284157) (owner: 10Bstorm) [10:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:27] T283131: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 [10:08:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host theemin.codfw.wmnet [10:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:19] (03CR) 10Kormat: [C: 03+2] db1157: Disable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/698469 (https://phabricator.wikimedia.org/T283131) (owner: 10Kormat) [10:10:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2005.wikimedia.org [10:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:02] (03CR) 10Muehlenhoff: [C: 03+2] htmldumps: Fully remove rate limit settings [puppet] - 10https://gerrit.wikimedia.org/r/698457 (owner: 10Muehlenhoff) [10:11:11] (03CR) 10MSantos: [C: 03+2] Trigger tegola latest build [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/698470 (owner: 10MSantos) [10:11:32] (03PS1) 10Kormat: install_server: switch db1157 to buster [puppet] - 10https://gerrit.wikimedia.org/r/698471 (https://phabricator.wikimedia.org/T283131) [10:12:18] (03Merged) 10jenkins-bot: Trigger tegola latest build [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/698470 (owner: 10MSantos) [10:12:25] (03CR) 10Ayounsi: Add Python 3.9 support (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/698463 (owner: 10Ayounsi) [10:12:38] (03CR) 10Kormat: [C: 03+2] install_server: switch db1157 to buster [puppet] - 10https://gerrit.wikimedia.org/r/698471 (https://phabricator.wikimedia.org/T283131) (owner: 10Kormat) [10:12:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:13:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2005.wikimedia.org [10:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:14:42] (03CR) 10MSantos: "recheck" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/698470 (owner: 10MSantos) [10:15:23] (03CR) 10Volans: [C: 03+1] "reply inline" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/698463 (owner: 10Ayounsi) [10:17:26] (03CR) 10MSantos: "republish" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/698470 (owner: 10MSantos) [10:18:41] (03CR) 10MSantos: "rebuild" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/698470 (owner: 10MSantos) [10:19:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2006.wikimedia.org [10:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:04] (03CR) 10Muehlenhoff: [C: 03+2] Switch htmldumps to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698458 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [10:21:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2006.wikimedia.org [10:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:39] (03PS3) 10Effie Mouzeli: mediawiki: fix ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/698454 (https://phabricator.wikimedia.org/T284421) [10:23:05] PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2065 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:23:33] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (grafana2001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:23:42] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698472 (https://phabricator.wikimedia.org/T128546) [10:24:15] !log remove now obsolete nginx mods and dependencies on htmldumper1001 T164456 [10:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:19] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [10:24:51] 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10conny-kawohl_WMDE) As an EM at WMDe I approve that @dang is one of the engineers in our team! [10:25:40] (03CR) 10Effie Mouzeli: "There are a few nits here and there, generally we believe that we are getting closer 😊" (0318 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [10:25:43] volans: huh. wmf-auto-reimage didn't log here that i was running it. is that a known issue? [10:26:09] (03PS4) 10Effie Mouzeli: mediawiki: fix ports and enable TLS on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/698454 (https://phabricator.wikimedia.org/T284421) [10:28:57] !log reimaging db1157 T283131 [10:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:01] T283131: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 [10:30:05] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210607T1030) [10:30:41] RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 302 Found - 435 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:30:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1003.wikimedia.org [10:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:12] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki: fix ports and enable TLS on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/698454 (https://phabricator.wikimedia.org/T284421) (owner: 10Effie Mouzeli) [10:31:19] kormat: the reimage script has never logged into SAL :) [10:31:27] writes to the phab task [10:32:17] volans: oh really? ok. my imagination exceeded reality again. [10:32:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1003.wikimedia.org [10:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:27] you get the ! log of the downtime though [10:33:29] :) [10:33:37] with the reason REIMAGE [10:33:41] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Enable TLS termination on the mwdebug deployment. fix the service definition in the chart - https://phabricator.wikimedia.org/T284421 (10jijiki) I have enabled TLS on staging for now, which will use some default certs [10:34:21] (03Merged) 10jenkins-bot: mediawiki: fix ports and enable TLS on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/698454 (https://phabricator.wikimedia.org/T284421) (owner: 10Effie Mouzeli) [10:34:37] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: REIMAGE [10:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:43] ^^^ [10:34:51] volans: ah hah [10:34:59] PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2065 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:35:00] perfect timing [10:35:20] godog: FYI grafana2001 ^^^ [10:35:37] thanks volans [10:36:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1004.wikimedia.org [10:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:29] RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 96469 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:36:43] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: REIMAGE [10:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:42] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698472 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:38:07] !log downgrade grafana to 7.4.2 on grafana2001 - T282863 [10:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:10] T282863: Upgrade Grafana to 8 - https://phabricator.wikimedia.org/T282863 [10:38:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1004.wikimedia.org [10:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:35] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698472 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:39:05] 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10karapayneWMDE) Email sent to confirm email [10:41:29] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:698472| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:33] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:42:26] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:698472| Bumping portals to master (T128546)]] (duration: 00m 56s) [10:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:46] !log reset netbox-next DB with the latest prod dump [10:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:49] PROBLEM - grafana.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2065 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [10:51:52] PROBLEM - snapshot of s8 in codfw on alert1001 is CRITICAL: snapshot for s8 at codfw taken more than 3 days ago: Most recent backup 2021-06-04 10:29:30 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:52:06] (03PS3) 10Ayounsi: Manage analytics-in4/6 with Capirca [homer/public] - 10https://gerrit.wikimedia.org/r/698202 (https://phabricator.wikimedia.org/T279429) [10:53:19] ACKNOWLEDGEMENT - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2021-06-04 09:18:11 Jcrespo T284415 - The acknowledgement expires at: 2021-06-08 10:52:47. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:53:19] ACKNOWLEDGEMENT - snapshot of s7 in codfw on alert1001 is CRITICAL: snapshot for s7 at codfw taken more than 3 days ago: Most recent backup 2021-06-03 12:39:53 Jcrespo T284415 - The acknowledgement expires at: 2021-06-08 10:52:47. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:53:19] ACKNOWLEDGEMENT - snapshot of s8 in codfw on alert1001 is CRITICAL: snapshot for s8 at codfw taken more than 3 days ago: Most recent backup 2021-06-04 10:29:30 Jcrespo T284415 - The acknowledgement expires at: 2021-06-08 10:52:47. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:53:19] ACKNOWLEDGEMENT - snapshot of x1 in codfw on alert1001 is CRITICAL: snapshot for x1 at codfw taken more than 3 days ago: Most recent backup 2021-06-03 02:23:55 Jcrespo T284415 - The acknowledgement expires at: 2021-06-08 10:52:47. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:53:56] RECOVERY - SSH on mw1284.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:54:00] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 77 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:54:32] (03PS1) 10Jelto: Add user jelto and add user to ops_members [puppet] - 10https://gerrit.wikimedia.org/r/698477 [10:54:56] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Patch-For-Review: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Nikerabbit) >>! In T282022#7137308, @Legoktm wrote: > I'm a bit behind on this, some other higher priority MM issues have come up, but I'm sti... [10:59:08] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 43 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a European mid-day backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210607T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:08] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:10] o/ [11:00:18] Lucas_WMDE can I add one? [11:00:23] sure! [11:00:39] (03PS3) 10DannyS712: Add 2021 namespaces for wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697824 (https://phabricator.wikimedia.org/T284235) [11:00:44] (03PS1) 10Muehlenhoff: Add Kerberos principals for DC ops [puppet] - 10https://gerrit.wikimedia.org/r/698480 (https://phabricator.wikimedia.org/T279721) [11:01:01] ^ didn't think I would be aroud for the window (I really should go to sleep) but since I am and I missed the window last time [11:01:54] Lucas_WMDE I assume you are willing to deploy, right? [11:01:58] yeah! [11:02:30] thanks [11:02:36] looking at the change now [11:02:51] I've added it to the deployments page too [11:04:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Namespace IDs not used for anything else AFAICT, should be good to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697824 (https://phabricator.wikimedia.org/T284235) (owner: 10DannyS712) [11:05:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add 2021 namespaces for wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697824 (https://phabricator.wikimedia.org/T284235) (owner: 10DannyS712) [11:05:48] (03Merged) 10jenkins-bot: Add 2021 namespaces for wikimania wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697824 (https://phabricator.wikimedia.org/T284235) (owner: 10DannyS712) [11:06:03] let me know when to test [11:06:10] pulled to mwdebug1001 [11:06:26] seems to work in https://wikimania.wikimedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces [11:06:49] https://wikimania.wikimedia.org/w/index.php?title=2021:Test&action=info looks fine too [11:07:27] yup, looks to be working [11:07:31] ok, syncing [11:08:30] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:697824|Add 2021 namespaces for wikimania wiki (T284235)]] (duration: 00m 56s) [11:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:36] T284235: On Wikimania wiki, create a namespace for 2021 - https://phabricator.wikimedia.org/T284235 [11:08:54] anything else to deploy? [11:09:28] yay, it worked! Thanks - nothing else from me [11:09:33] ok! :) [11:09:37] !log EU backport+config window done [11:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:16] mwdebug logstash has some LogicExceptions but I assume those are unrelated [11:10:38] {{checking}} [11:13:53] the errors about "PHP Notice: Undefined index: subpages" were related but should be fine now, caused by the page switching from a namespace with subpages to one without I think. I don't see LogicExceptions ? [11:14:31] “LogicException: Process cache for 'en' should be set by now.” [11:14:57] from MessageCache.php [11:15:03] oh, I was looking at the mediawiki-errors dashboard - yeah, that looks unrelated [11:15:42] caused by me loading a bunch of user scripts on meta I think [11:16:07] ok [11:16:48] also, while I'm here, did you see my explanation at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/695422 ? [11:24:22] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:26:38] (03CR) 10Muehlenhoff: [C: 03+2] Add Kerberos principals for DC ops [puppet] - 10https://gerrit.wikimedia.org/r/698480 (https://phabricator.wikimedia.org/T279721) (owner: 10Muehlenhoff) [11:33:24] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:34:51] (03CR) 10MSantos: "recheck publish" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/698470 (owner: 10MSantos) [11:34:53] DannyS712: yes [11:35:07] but I don’t have anything to add to it, I think [11:35:57] okay, I guess I'll just have to wait and see what the wikibase team decides about supporting 1.36 - I can also try to investigate and see if there is a way no still support Revisions without that class existing in core [11:53:25] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10MoritzMuehlenhoff) 05Open→03Resolved >>! In T279721#7127346, @Cmjohnson wrote: > @MoritzMuehlenhoff Do we need to keep this task open any longer? We can close it. [11:55:05] (03CR) 10Hashar: "I have manually retriggered the postmerge jobs (which requires access on the CI server) using:" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/698470 (owner: 10MSantos) [11:56:09] (03CR) 10MSantos: "Thanks @Hashar!" [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/698470 (owner: 10MSantos) [11:56:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/698462 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [11:57:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] R:systemd::timer::job: use the syslog_identier for the syslog programname [puppet] - 10https://gerrit.wikimedia.org/r/698462 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [12:00:30] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:33] (03PS2) 10Hashar: gerrit: remove Java 8 packages [puppet] - 10https://gerrit.wikimedia.org/r/696591 (https://phabricator.wikimedia.org/T268225) [12:11:37] (03CR) 10Filippo Giunchedi: [C: 03+2] rsync: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/698308 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [12:11:39] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: Drop absented cron [puppet] - 10https://gerrit.wikimedia.org/r/698307 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [12:16:52] (03PS1) 10Muehlenhoff: role::dumps::distribution::server: Switch to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/698485 [12:18:09] (03CR) 10Ema: prometheus: Add dependency between varnish exporter and varnish service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [12:18:49] (03CR) 10Muehlenhoff: [C: 03+2] gerrit: remove Java 8 packages [puppet] - 10https://gerrit.wikimedia.org/r/696591 (https://phabricator.wikimedia.org/T268225) (owner: 10Hashar) [12:19:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698485 (owner: 10Muehlenhoff) [12:22:03] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=wikimaniawiki # T284442 [12:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:07] T284442: Retrieve existing pages created for 2021 Wikimania before creation of 2021 namespace - https://phabricator.wikimedia.org/T284442 [12:22:37] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=wikimaniawiki --add-prefix=BROKEN --fix # T284442 [12:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:05] Lucas_WMDE: ^^ you forgot to run this script, and broke all existing pages starting with 2021: 🙂 [12:24:10] RECOVERY - snapshot of s8 in codfw on alert1001 is OK: Last snapshot for s8 at codfw (db2098.codfw.wmnet:3318) taken on 2021-06-07 10:46:02 (1252 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:24:20] oh right, damn [12:24:33] just did it, should work now [12:24:36] thanks [12:24:40] np [12:25:01] !log installing nginx security updates on buster [12:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:40] (03PS2) 10Jcrespo: dbbackups: Switchover s3 backup source from db1171 to db1102 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/692845 (https://phabricator.wikimedia.org/T283131) [12:31:38] (03CR) 10MMandere: "Ack" [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [12:32:11] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [12:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:20] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:35:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:28] !log removing now obsolete Java 8 packages from contint* T268225 [12:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:32] T268225: Switch Gerrit from Java 8 to Java 11 - https://phabricator.wikimedia.org/T268225 [12:41:30] !log removing now obsolete Java 8 packages from gerrit* T268225 [12:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:34] T268225: Switch Gerrit from Java 8 to Java 11 - https://phabricator.wikimedia.org/T268225 [12:49:06] RECOVERY - grafana.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 68413 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [12:50:50] (03PS1) 10Urbanecm: Make it possible to deploy welcomesurvey to % of users that's not divisible by 10 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698364 (https://phabricator.wikimedia.org/T284127) [12:51:09] (03PS1) 10Urbanecm: Align welcome survey group with homepage group [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698365 (https://phabricator.wikimedia.org/T284257) [12:51:26] 10SRE, 10MW-on-K8s, 10serviceops: Add conditional to mediawiki-config for stuff running on kubernetes - https://phabricator.wikimedia.org/T284418 (10Reedy) [12:51:28] (03PS2) 10Urbanecm: Align welcome survey group with homepage group [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698365 (https://phabricator.wikimedia.org/T284257) [12:56:16] (03CR) 10JMeybohm: [C: 03+1] Add user jelto [puppet] - 10https://gerrit.wikimedia.org/r/698477 (owner: 10Jelto) [12:58:19] (03PS2) 10Ottomata: admin: amend west1 uid to uidNumber from ldap [puppet] - 10https://gerrit.wikimedia.org/r/698205 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite) [12:58:43] (03CR) 10JMeybohm: [C: 03+1] "Key was validated over the phone" [puppet] - 10https://gerrit.wikimedia.org/r/698477 (owner: 10Jelto) [12:59:02] (03PS1) 10Ema: alertmanager: define IRC and page routes for sre team [puppet] - 10https://gerrit.wikimedia.org/r/698491 (https://phabricator.wikimedia.org/T273716) [12:59:36] (03CR) 10Ottomata: [C: 03+2] admin: amend west1 uid to uidNumber from ldap [puppet] - 10https://gerrit.wikimedia.org/r/698205 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite) [13:02:16] (03PS2) 10Ema: alertmanager: define IRC and page routes for sre team [puppet] - 10https://gerrit.wikimedia.org/r/698491 (https://phabricator.wikimedia.org/T273716) [13:03:56] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: define IRC and page routes for sre team [puppet] - 10https://gerrit.wikimedia.org/r/698491 (https://phabricator.wikimedia.org/T273716) (owner: 10Ema) [13:05:18] (03CR) 10JMeybohm: [C: 03+2] Add user jelto [puppet] - 10https://gerrit.wikimedia.org/r/698477 (owner: 10Jelto) [13:05:29] (03PS3) 10JMeybohm: Add user jelto [puppet] - 10https://gerrit.wikimedia.org/r/698477 (owner: 10Jelto) [13:06:32] (03PS2) 10MMandere: prometheus: Add dependency between varnish exporter and varnish service [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) [13:07:34] (03CR) 10Ottomata: "Had to stop jupyter-west1-singleuser on stat1006." [puppet] - 10https://gerrit.wikimedia.org/r/698205 (https://phabricator.wikimedia.org/T284136) (owner: 10Cwhite) [13:07:58] (03CR) 10Filippo Giunchedi: Netops team alert: ping offload (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [13:12:58] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:50] (03CR) 10Jcrespo: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) (owner: 10Jcrespo) [13:23:50] (03PS1) 10Ssingh: site: add wikidough eqiad with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/698505 [13:24:38] (03PS2) 10Ssingh: site: add wikidough eqiad with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/698505 (https://phabricator.wikimedia.org/T284348) [13:27:35] (03PS1) 10Jcrespo: mariadb: Temporarilly reduce retention of dbprov2003-stored backups [puppet] - 10https://gerrit.wikimedia.org/r/698506 (https://phabricator.wikimedia.org/T284415) [13:27:40] (03PS1) 10Filippo Giunchedi: alertmanager: update dashboard minimum group width to 2048 [puppet] - 10https://gerrit.wikimedia.org/r/698507 (https://phabricator.wikimedia.org/T284213) [13:27:45] !log volans@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next [13:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:10] (03PS2) 10Jcrespo: mariadb: Temporarily reduce retention of dbprov2003-stored backups [puppet] - 10https://gerrit.wikimedia.org/r/698506 (https://phabricator.wikimedia.org/T284415) [13:28:40] !log volans@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (duration: 00m 54s) [13:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:52] (03CR) 10Herron: "made another pass this morning, a few questions/comments about cloud.yaml and a few minor nits. looks ready to me otherwise" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [13:30:55] (03PS1) 10Ssingh: acme_chief: authorize doh100* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698508 (https://phabricator.wikimedia.org/T284348) [13:31:10] !log volans@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next [13:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:43] (03CR) 10Herron: prometheus::pop add retention size param and set to 80G (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [13:31:56] (03CR) 10Herron: [C: 03+2] prometheus::pop add retention size param and set to 80G [puppet] - 10https://gerrit.wikimedia.org/r/698216 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [13:32:07] (03PS1) 10Muehlenhoff: Enable profile::nginx for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/698509 (https://phabricator.wikimedia.org/T164456) [13:32:25] !log volans@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (duration: 01m 14s) [13:32:26] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29807/console" [puppet] - 10https://gerrit.wikimedia.org/r/698508 (https://phabricator.wikimedia.org/T284348) (owner: 10Ssingh) [13:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:09] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698509 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:34:30] !log installing libxml2 security updates on stretch [13:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:47] !log volans@deploy1002 Started deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (3) [13:34:50] (03CR) 10Herron: [C: 03+1] alertmanager: update dashboard minimum group width to 2048 [puppet] - 10https://gerrit.wikimedia.org/r/698507 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [13:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:04] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:35:27] (03PS3) 10Ema: Netops team alert: ping offload [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) [13:35:39] !log volans@deploy1002 Finished deploy [netbox/deploy@c70df91]: Force deploy of gerrit/672831 to netbox-next (3) (duration: 00m 52s) [13:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:03] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10MoritzMuehlenhoff) [13:36:47] (03CR) 10jerkins-bot: [V: 04-1] Netops team alert: ping offload [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [13:36:56] if anyone around has superpowers could you change clinic duty with myself in the topic please? [13:37:05] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10MoritzMuehlenhoff) I ticked off the "Reimage bastion" step from the task description since that happened a while ago with the Buster update. [13:37:54] volans would it make sense to grant ops to those who are regularly on clinic duty so that you can update it yourself? Just wondering [13:39:15] all SREs in the SRE teams are on clinic duty, it's a rotation ;) [13:39:17] (03PS4) 10Ema: Netops team alert: ping offload [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) [13:39:24] (03CR) 10Herron: [C: 03+1] alertmanager: print link separators on IRC when needed [puppet] - 10https://gerrit.wikimedia.org/r/698459 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [13:39:42] (03CR) 10Ema: Netops team alert: ping offload (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [13:40:26] thanks marostegui! [13:40:29] (03CR) 10Ema: [C: 03+2] alertmanager: define IRC and page routes for sre team [puppet] - 10https://gerrit.wikimedia.org/r/698491 (https://phabricator.wikimedia.org/T273716) (owner: 10Ema) [13:40:47] yeah, so all SREs should maybe have the rights? [13:40:55] (03CR) 10Ayounsi: [C: 03+1] "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [13:41:43] I wouldn't mind, but not to me to decide either ;) [13:43:02] PROBLEM - Prometheus prometheus3001/ops restarted: beware possible monitoring artifacts on prometheus3001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [13:43:31] (03CR) 10Vgutierrez: [C: 03+1] "looks good to me, maybe it's worth considering consolidating those 3 on something like doh[123]00[12]\.wikimedia\.org" [puppet] - 10https://gerrit.wikimedia.org/r/698508 (https://phabricator.wikimedia.org/T284348) (owner: 10Ssingh) [13:43:36] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus site={eqsin,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [13:43:43] Az1568/joe_oblivian/legoktm/ircservserv-wm has founder rights, so any of them should be able to do it [13:44:18] (03Abandoned) 10Effie Mouzeli: (WIP) Add tokens for maps-vector-server [labs/private] - 10https://gerrit.wikimedia.org/r/692865 (owner: 10Effie Mouzeli) [13:44:30] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus site={eqsin,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [13:44:32] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site={eqsin,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [13:44:54] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [13:46:58] godog: can the prom restarts be because of the AM patch? ^ [13:47:32] no I don't think so, I'll take a look [13:47:48] i.e. the alertmanager configuration isn't in prometheus [13:48:04] could have been my patch, although that should only have bounced pop instance [13:48:08] instances* [13:48:26] herron: ah yeah that's it, I missed the merge [13:48:36] ema: ^ [13:48:44] ack, thanks! [13:50:19] (03PS1) 10Muehlenhoff: Switch acmechief-test1001 to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698510 (https://phabricator.wikimedia.org/T164456) [13:50:21] (03PS1) 10Muehlenhoff: Switch to nginx-light on all acmechief servers [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) [13:51:37] (03CR) 10jerkins-bot: [V: 04-1] Switch acmechief-test1001 to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698510 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:51:40] (03CR) 10jerkins-bot: [V: 04-1] Switch to nginx-light on all acmechief servers [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [13:55:36] (03CR) 10Filippo Giunchedi: "LGTM! See inline for a detail I forgot" (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [13:56:11] (03CR) 10Ssingh: [V: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/698508 (https://phabricator.wikimedia.org/T284348) (owner: 10Ssingh) [13:56:20] (03CR) 10Ssingh: [V: 03+1 C: 03+2] acme_chief: authorize doh100* hosts for Wikidough [puppet] - 10https://gerrit.wikimedia.org/r/698508 (https://phabricator.wikimedia.org/T284348) (owner: 10Ssingh) [13:57:21] (03CR) 10Effie Mouzeli: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/695377 (https://phabricator.wikimedia.org/T271967) (owner: 10Jbond) [13:57:24] PROBLEM - snapshot of s3 in codfw on alert1001 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2021-06-04 13:35:54 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [13:57:25] (03CR) 10Effie Mouzeli: [C: 03+2] modules::memcached: add notls support for external addresses [puppet] - 10https://gerrit.wikimedia.org/r/695377 (https://phabricator.wikimedia.org/T271967) (owner: 10Jbond) [13:57:53] (03PS2) 10Muehlenhoff: Switch to nginx-light on all acmechief servers [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) [13:58:53] 10SRE, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests: Requesting access to contint-admins for Ladsgroup - https://phabricator.wikimedia.org/T283925 (10hashar) A bit late but `+1` since Amir has a tendency of being super helpful on everything he touches and I trust him to ask questions when h... [14:00:54] (03PS2) 10Muehlenhoff: Switch acmechief-test1001 to nginx-light [puppet] - 10https://gerrit.wikimedia.org/r/698510 (https://phabricator.wikimedia.org/T164456) [14:01:00] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10hashar) [14:01:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698511 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:01:14] (03PS1) 10Ayounsi: Add OSPF link-protection to all P2P links [homer/public] - 10https://gerrit.wikimedia.org/r/698512 (https://phabricator.wikimedia.org/T167306) [14:02:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/698510 (https://phabricator.wikimedia.org/T164456) (owner: 10Muehlenhoff) [14:04:54] RECOVERY - Prometheus prometheus3001/ops restarted: beware possible monitoring artifacts on prometheus3001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [14:05:44] (03CR) 10JMeybohm: [C: 04-1] mediawiki: add ca-bundle to chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/698456 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [14:09:30] 10SRE, 10serviceops: Refactor memcached modules - https://phabricator.wikimedia.org/T284454 (10jijiki) [14:09:42] 10SRE, 10serviceops, 10User-jijiki: Refactor memcached modules - https://phabricator.wikimedia.org/T284454 (10jijiki) [14:10:47] (03PS5) 10Ema: Netops team alert: ping offload [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) [14:11:37] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for ihurbain - https://phabricator.wikimedia.org/T284437 (10Volans) p:05Triage→03Medium @ihurbain welcome aboard! Have your @wikimedia.org email account username already been created? If so, which one is it? [14:12:17] (03CR) 10jerkins-bot: [V: 04-1] Netops team alert: ping offload [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [14:13:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1113 (s5,s6) for upgrade', diff saved to https://phabricator.wikimedia.org/P16314 and previous config saved to /var/cache/conftool/dbconfig/20210607-141307-marostegui.json [14:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:14] (03PS1) 10Itamar Givon: Set Wikidata's main sandbox item [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) [14:13:47] (03PS1) 10Ottomata: drop_event job - Use dedicated Hive CLI log file [puppet] - 10https://gerrit.wikimedia.org/r/698519 (https://phabricator.wikimedia.org/T283126) [14:14:00] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [14:14:26] (03CR) 10Filippo Giunchedi: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [14:14:32] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [14:15:12] (03CR) 10jerkins-bot: [V: 04-1] Set Wikidata's main sandbox item [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) (owner: 10Itamar Givon) [14:15:17] (03CR) 10jerkins-bot: [V: 04-1] drop_event job - Use dedicated Hive CLI log file [puppet] - 10https://gerrit.wikimedia.org/r/698519 (https://phabricator.wikimedia.org/T283126) (owner: 10Ottomata) [14:15:28] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [14:15:30] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [14:17:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1113 (s5,s6) after upgrade', diff saved to https://phabricator.wikimedia.org/P16315 and previous config saved to /var/cache/conftool/dbconfig/20210607-141722-marostegui.json [14:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:48] (03PS2) 10Ottomata: drop_event job - Use dedicated Hive CLI log file [puppet] - 10https://gerrit.wikimedia.org/r/698519 (https://phabricator.wikimedia.org/T283126) [14:17:55] (03PS1) 10Gerrit maintenance bot: Add dag to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/698521 (https://phabricator.wikimedia.org/T284450) [14:18:50] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/698521 (https://phabricator.wikimedia.org/T284450) (owner: 10Gerrit maintenance bot) [14:18:55] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29810/console" [puppet] - 10https://gerrit.wikimedia.org/r/698519 (https://phabricator.wikimedia.org/T283126) (owner: 10Ottomata) [14:20:16] (03CR) 10Ottomata: [V: 03+1 C: 03+2] drop_event job - Use dedicated Hive CLI log file [puppet] - 10https://gerrit.wikimedia.org/r/698519 (https://phabricator.wikimedia.org/T283126) (owner: 10Ottomata) [14:22:18] (03Abandoned) 10Urbanecm: Rename to MediaSearch & activate preferences & hooks [extensions/MediaSearch] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696042 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [14:22:31] (03Abandoned) 10Urbanecm: Rename to OldMediaSearch & remove duplicate preferences & hooks [extensions/WikibaseMediaInfo] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/696039 (https://phabricator.wikimedia.org/T265939) (owner: 10Matthias Mullie) [14:22:42] (03PS2) 10Itamar Givon: Set Wikidata's main sandbox item [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) [14:24:23] (03CR) 10jerkins-bot: [V: 04-1] Set Wikidata's main sandbox item [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) (owner: 10Itamar Givon) [14:24:38] (03PS1) 10BBlack: wikimedia.com: facilitate NS changes [dns] - 10https://gerrit.wikimedia.org/r/698525 (https://phabricator.wikimedia.org/T281428) [14:25:34] (03CR) 10jerkins-bot: [V: 04-1] wikimedia.com: facilitate NS changes [dns] - 10https://gerrit.wikimedia.org/r/698525 (https://phabricator.wikimedia.org/T281428) (owner: 10BBlack) [14:26:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "The jenkins failure is due to using a new feature (external_labels) in tests, the CI image uses buster while bullseye is needed. I'm looki" [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [14:27:30] (03PS1) 10DCausse: Add akhatun to analytics-search [puppet] - 10https://gerrit.wikimedia.org/r/698546 [14:28:37] (03CR) 10Ema: [V: 03+2 C: 03+2] Netops team alert: ping offload [alerts] - 10https://gerrit.wikimedia.org/r/697710 (https://phabricator.wikimedia.org/T282806) (owner: 10Ema) [14:28:49] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10Andrew) 05Open→03Resolved This host is now back in normal service. Thank you @Jclark-ctr ! [14:29:09] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for ihurbain - https://phabricator.wikimedia.org/T284437 (10ihurbain) Hi @Volans :) Yes, I have an email account user name: ihurbainpalatin. [14:30:53] (03PS1) 10Urbanecm: initWikiConfig.php: Use same link ID for help panel links as community configuration would [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698527 (https://phabricator.wikimedia.org/T284072) [14:31:18] (03PS2) 10BBlack: wikimedia.com: facilitate NS changes [dns] - 10https://gerrit.wikimedia.org/r/698525 (https://phabricator.wikimedia.org/T281428) [14:33:53] (03PS4) 10Razzi: Add dbstore1007 to analytics firewall [homer/public] - 10https://gerrit.wikimedia.org/r/697704 (https://phabricator.wikimedia.org/T283125) [14:34:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/696282 (https://phabricator.wikimedia.org/T283660) (owner: 10MMandere) [14:35:29] !log installing isc-dhcp security updates [14:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:28] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for ihurbain - https://phabricator.wikimedia.org/T284437 (10ssastry) @Volans let me know if you need anything from me as @ihurbain's manager. [14:38:38] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:05] (03CR) 10Elukey: [C: 03+1] Add dbstore1007 to analytics firewall [homer/public] - 10https://gerrit.wikimedia.org/r/697704 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [14:43:20] (03PS1) 10Filippo Giunchedi: pipeline: use bullseye to get newer prometheus [alerts] - 10https://gerrit.wikimedia.org/r/698548 (https://phabricator.wikimedia.org/T282806) [14:47:41] (03CR) 10Razzi: [C: 03+2] Add dbstore1007 to analytics firewall [homer/public] - 10https://gerrit.wikimedia.org/r/697704 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [14:48:18] (03CR) 10Ema: [C: 03+1] pipeline: use bullseye to get newer prometheus [alerts] - 10https://gerrit.wikimedia.org/r/698548 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [14:48:21] (03Merged) 10jenkins-bot: Add dbstore1007 to analytics firewall [homer/public] - 10https://gerrit.wikimedia.org/r/697704 (https://phabricator.wikimedia.org/T283125) (owner: 10Razzi) [14:48:58] 10SRE, 10Platform Engineering, 10Release Pipeline, 10serviceops, and 5 others: Kask functional testing with Cassandra via the Deployment Pipeline - https://phabricator.wikimedia.org/T224041 (10hashar) [14:50:37] (03CR) 10DCausse: [C: 03+1] "Aisha will need to deploy a new job that refines sparql queries captured from wdqs public endpoints" [puppet] - 10https://gerrit.wikimedia.org/r/698546 (owner: 10DCausse) [14:50:41] (03PS1) 10Urbanecm: initWikiConfig.php: Use links to MW.org as fallbacks to Wikidata [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698549 (https://phabricator.wikimedia.org/T284072) [14:53:47] (03PS1) 10Razzi: kerberos: add krb: present for west1 [puppet] - 10https://gerrit.wikimedia.org/r/698550 (https://phabricator.wikimedia.org/T284022) [14:55:30] (03PS1) 10Urbanecm: skwiki: Make Growth features available in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698551 (https://phabricator.wikimedia.org/T284149) [14:56:01] (03CR) 10Razzi: [C: 03+2] kerberos: add krb: present for west1 [puppet] - 10https://gerrit.wikimedia.org/r/698550 (https://phabricator.wikimedia.org/T284022) (owner: 10Razzi) [14:56:44] (03CR) 10Filippo Giunchedi: [C: 03+2] pipeline: use bullseye to get newer prometheus [alerts] - 10https://gerrit.wikimedia.org/r/698548 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [14:57:22] !log installing remaining lz4 security updates on buster [14:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:45] (03PS1) 10Urbanecm: Set WelcomeSurveyEnableWithHomepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698552 (https://phabricator.wikimedia.org/T281896) [14:58:20] (03PS3) 10Jcrespo: dbbackups: Switchover codfw s5 backups from db2099 to db2101 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) [15:00:02] (03PS1) 10Urbanecm: enwiki: Deploy Growth freatures to 2% of new accounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698555 (https://phabricator.wikimedia.org/T281896) [15:00:14] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for ihurbain - https://phabricator.wikimedia.org/T284437 (10Volans) 05Open→03Resolved a:03Volans @ssastry no need, thanks. @ihurbain: great, all done, `wmf` membership added to your user. Feel free to re-open this task in case you encounter any issue. [15:01:32] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) We have another bunch of #release-engineering-team images failing, probably all due to being jessie based: ` Jun 7 11:28:11 deneb docker-report-rel... [15:02:46] (03PS1) 10Ottomata: analytics cluster - Remove bigtop and stretch overrides where not needed [puppet] - 10https://gerrit.wikimedia.org/r/698556 (https://phabricator.wikimedia.org/T275786) [15:03:26] (03CR) 10Ebernhardson: [C: 03+1] Add akhatun to analytics-search [puppet] - 10https://gerrit.wikimedia.org/r/698546 (owner: 10DCausse) [15:07:01] 10SRE, 10LDAP-Access-Requests: Get access to a page using a Dev Account - https://phabricator.wikimedia.org/T284249 (10Volans) >>! In T284249#7134559, @LZaman wrote: > My manager is actually away at the moment, so can we keep this ticket open for a few weeks until she is back, and then I will get her approval?... [15:07:23] 10SRE, 10LDAP-Access-Requests: Access request to superset for user lzaman - https://phabricator.wikimedia.org/T284249 (10Volans) [15:12:08] 10SRE, 10Wikimedia-Mailing-lists, 10translatewiki.net, 10Patch-For-Review: Add mailman-templates to translatewiki.net - https://phabricator.wikimedia.org/T282022 (10Legoktm) >>! In T282022#7138044, @Nikerabbit wrote: >>>! In T282022#7137308, @Legoktm wrote: >> * Is `sr` expected to be in Latin or Cyrillic... [15:12:13] (03CR) 10Elukey: [C: 03+1] "LGTM! Just to be sure let's run a pcc but it seems an easy no-op :)" [puppet] - 10https://gerrit.wikimedia.org/r/698556 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [15:13:37] 10SRE, 10LDAP-Access-Requests: Access request to superset for user lzaman - https://phabricator.wikimedia.org/T284249 (10Volans) I've added you to the LDAP `wmf` group as that doesn't require additional approvals. @LZaman I was re-checking your original request and if you just need access to Superset, this sh... [15:14:00] (03PS3) 10BBlack: wikimedia.com: facilitate NS changes [dns] - 10https://gerrit.wikimedia.org/r/698525 (https://phabricator.wikimedia.org/T281428) [15:14:02] (03PS1) 10BBlack: wikimedia.com: reduce NS TTL to 1H [dns] - 10https://gerrit.wikimedia.org/r/698560 (https://phabricator.wikimedia.org/T281428) [15:15:09] (03CR) 10BBlack: [C: 03+2] wikimedia.com: reduce NS TTL to 1H [dns] - 10https://gerrit.wikimedia.org/r/698560 (https://phabricator.wikimedia.org/T281428) (owner: 10BBlack) [15:25:54] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10MoritzMuehlenhoff) Can we remove all jessie-related containers from the registry? [15:33:25] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10Jdforrester-WMF) >>! In T251918#7138799, @JMeybohm wrote: > We have another bunch of #release-engineering-team images failing, probably all due to being jessie... [15:41:26] PROBLEM - Bird Internet Routing Daemon on authdns1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:42:26] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:42:30] PROBLEM - AuthDNS-over-TLS Works on authdns1001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [15:42:44] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:43:00] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:43:06] PROBLEM - Check if anycast-healthchecker and all configured threads are running on authdns1001 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:46:35] 10SRE, 10Release Pipeline, 10serviceops, 10Release-Engineering-Team (Radar), and 2 others: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10JMeybohm) [15:47:08] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jbond: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10JMeybohm) >>! In T251918#7138881, @MoritzMuehlenhoff wrote: > Can we remove all jessie-related containers from the registry? Yeah, we "kind of" can: https://w... [15:51:22] (03CR) 10Lucas Werkmeister (WMDE): Set Wikidata's main sandbox item (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698518 (https://phabricator.wikimedia.org/T219215) (owner: 10Itamar Givon) [15:55:00] RECOVERY - snapshot of s7 in codfw on alert1001 is OK: Last snapshot for s7 at codfw (db2098.codfw.wmnet:3317) taken on 2021-06-07 13:39:23 (1080 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:07:48] (03CR) 10AKhatun: [C: 03+1] "Thanks 😊" [puppet] - 10https://gerrit.wikimedia.org/r/698546 (owner: 10DCausse) [16:09:10] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@f236b95]: Bump glent jar to 0.2.6 [16:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:45] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@f236b95]: Bump glent jar to 0.2.6 (duration: 00m 35s) [16:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:12] (03CR) 10Giuseppe Lavagetto: mediawiki: add ca-bundle to chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/698456 (https://phabricator.wikimedia.org/T284417) (owner: 10Giuseppe Lavagetto) [16:20:41] (03PS4) 10Giuseppe Lavagetto: mediawiki: add ca-bundle to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/698456 (https://phabricator.wikimedia.org/T284417) [16:23:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) 05Open→03Resolved @RKemper main board replaced on the server and firmware upgrade done as well. The server is back up on line [16:26:44] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) I spoke with the HP engineer last week, he said that it is true that CPU1 is bad but it might also be the pin on the main board so he will be s... [16:27:06] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@19313f7]: Bump glent jar to 0.2.6 [16:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:56] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephosd102[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10RobH) [16:28:16] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29812/console" [puppet] - 10https://gerrit.wikimedia.org/r/698556 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [16:28:47] 10SRE, 10Packaging, 10Release-Engineering-Team (Seen): Debian-glue doesn't check for the validity of the distribution in the changelog. - https://phabricator.wikimedia.org/T252619 (10hashar) The job uses dpkg-parsechangelog and sets `distribution` which is then used to set `DIST`. That is then passed to pbui... [16:28:47] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephosd102[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10RobH) [16:29:33] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephosd102[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10RobH) a:03Jclark-ctr [16:30:23] (03PS1) 10Herron: onboard apigw dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/698569 [16:31:35] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@19313f7]: Bump glent jar to 0.2.6 (duration: 04m 29s) [16:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:38] (03PS2) 10Herron: onboard apigw dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/698569 [16:36:14] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) Hey, Papaul, there is no rush (although it is a bit anoying from HP side), but do you think the initial ETA of today will be delayed? Please a... [16:36:40] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @fgiunchedi sensor is in place [16:37:21] (03PS8) 10Majavah: Route Grid engine web requests via Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) [16:38:23] (03CR) 10Majavah: "> Patch Set 7:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [16:39:07] (03CR) 10Ottomata: [V: 03+1 C: 03+2] analytics cluster - Remove bigtop and stretch overrides where not needed [puppet] - 10https://gerrit.wikimedia.org/r/698556 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [16:39:39] (03PS9) 10Ottomata: mariadb::instance - allow passing extra configs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/697618 (https://phabricator.wikimedia.org/T272973) [16:39:58] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) @jcrespo i have no ETA for you for now since HP already send the case to the dispatch (third party UniSys) so someone should contact me. it is a... [16:41:23] 10SRE, 10Continuous-Integration-Infrastructure, 10observability, 10Patch-For-Review, 10Release-Engineering-Team (Seen): Export zuul metrics to Prometheus - https://phabricator.wikimedia.org/T233089 (10hashar) [16:42:02] 10SRE, 10SRE-tools, 10Release-Engineering-Team (Seen): Support running puppet Beaker on CI - https://phabricator.wikimedia.org/T253635 (10hashar) [16:42:27] (03CR) 10Ottomata: [C: 03+2] mariadb::instance - allow passing extra configs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/697618 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [16:43:09] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) @Papaul, thanks, that is already useful info for handling the db status, and all I needed! Please contact @Marostegui or @kormat if there are n... [16:43:54] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) okay [16:51:52] !log run homer '*.eqiad.wmnet' diff [16:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:57] Did that an hour ago, forgot to log [16:55:00] Is it known that Mediawiki is spewing poolcounter queue full warnings about CirrusSearch? `Pool error on CirrusSearch-Search:_elasticsearch: pool-queuefull` [16:55:48] shdubsh: not known, looking now [16:56:02] started about 9:15 UTC afaict [16:56:13] !log [WDQS] `ryankemper@wdqs1005:~$ sudo systemctl restart wdqs-blazegraph` (blazegraph locked up) [16:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:18] ryankemper: thanks :) [16:56:29] shdubsh: thanks for the heads! [17:00:04] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210607T1700). [17:00:43] Yeah we can see the spike in pool counter rejections here: https://grafana.wikimedia.org/d/qrOStmdGk/elasticsearch-pool-counters?orgId=1&from=1622945262851&to=1623085183053 (lines up with that 9:15 UTC time given above) [17:01:57] (03PS1) 10Ottomata: analytics cluster - Remove more deb packages that sbould not be needed [puppet] - 10https://gerrit.wikimedia.org/r/698575 (https://phabricator.wikimedia.org/T275786) [17:02:03] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:39] Corresponding spike in https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=47&orgId=1&from=1622945262851&to=1623085183053 - looks like it's the `fulltext` operations that are eating up the capacity [17:05:47] (03CR) 10Ottomata: [C: 03+2] "Will run" [puppet] - 10https://gerrit.wikimedia.org/r/698575 (https://phabricator.wikimedia.org/T275786) (owner: 10Ottomata) [17:11:57] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:12:05] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 73, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:12:13] RECOVERY - Check if anycast-healthchecker and all configured threads are running on authdns1001 is OK: OK: UP (pid=24317) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:12:19] RECOVERY - Bird Internet Routing Daemon on authdns1001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:12:45] RECOVERY - AuthDNS-over-TLS Works on authdns1001 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [17:12:53] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:13:24] (03PS1) 10Phuedx: universalLanguageSelector: Add missing properties [extensions/WikimediaEvents] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698535 (https://phabricator.wikimedia.org/T280770) [17:13:49] (03PS1) 10Phuedx: Pass context to compact_language_links.open hook [extensions/UniversalLanguageSelector] (wmf/1.37.0-wmf.8) - 10https://gerrit.wikimedia.org/r/698536 (https://phabricator.wikimedia.org/T280770) [17:13:56] (03CR) 10jerkins-bot: [V: 04-1] universalLanguageSelector: Add missing properties [extensions/WikimediaEvents] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698535 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [17:14:56] (03CR) 10Phuedx: "Recheck." [extensions/WikimediaEvents] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698535 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [17:15:22] 10SRE, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10wiki_willy) Hi @Dzahn - do you have an ETA on when we can start removing these from the racks? We have a few installs that are partially compl... [17:15:29] (03Abandoned) 10Phuedx: Pass context to compact_language_links.open hook [extensions/UniversalLanguageSelector] (wmf/1.37.0-wmf.8) - 10https://gerrit.wikimedia.org/r/698536 (https://phabricator.wikimedia.org/T280770) (owner: 10Phuedx) [17:15:43] (03PS1) 10Phuedx: Pass context to compact_language_links.open hook [extensions/UniversalLanguageSelector] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698537 (https://phabricator.wikimedia.org/T280770) [17:15:47] 10SRE, 10serviceops, 10User-jbond, 10User-jijiki: Refactor memcached modules - https://phabricator.wikimedia.org/T284454 (10jbond) [17:16:03] Feels like some sort of bot is hammering us really hard, thus the spike in QPS and the corresponding increase in poolcounter rejections [17:17:05] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:17:36] (03CR) 10Razzi: [C: 03+1] hadoop: increase the HDFS Namenode's service handler threads [puppet] - 10https://gerrit.wikimedia.org/r/698194 (https://phabricator.wikimedia.org/T283733) (owner: 10Elukey) [17:17:54] !log [Cirrussearch] We're seeing ~10% of current requests being rejected by poolcounter, due to ~2x expected `eqiad.full_text` query volume and ~30x expected `eqiad.entity_full_text` query volume [17:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:46] ACKNOWLEDGEMENT - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] Ryan Kemper https://phabricator.wikimedia.org/T284479 https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:25:58] Made https://phabricator.wikimedia.org/T284479 to track this. Current impact is ~10% of all search requests will be rejected, forcing the user to need to re-submit the request to get another chance at their request being processed [17:26:51] (03PS1) 10Majavah: Fix prometheus monitoring for Toolforge Ingress [puppet] - 10https://gerrit.wikimedia.org/r/698578 (https://phabricator.wikimedia.org/T284353) [17:28:58] (03CR) 10Klausman: [C: 03+1] "> Patch Set 7:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/693826 (https://phabricator.wikimedia.org/T280661) (owner: 10Elukey) [17:30:25] (03PS3) 10Jcrespo: mariadb: Temporarily reduce retention of dbprov2003-stored backups [puppet] - 10https://gerrit.wikimedia.org/r/698506 (https://phabricator.wikimedia.org/T284415) [17:30:40] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) (owner: 10Jbond) [17:31:06] (03CR) 10Jcrespo: "Going plan B, as plan A is not likely happening today. This should avoid ongoing issues until hw is back into service." [puppet] - 10https://gerrit.wikimedia.org/r/698506 (https://phabricator.wikimedia.org/T284415) (owner: 10Jcrespo) [17:32:35] (03CR) 10Jcrespo: [C: 03+2] mariadb: Temporarily reduce retention of dbprov2003-stored backups [puppet] - 10https://gerrit.wikimedia.org/r/698506 (https://phabricator.wikimedia.org/T284415) (owner: 10Jcrespo) [17:34:12] 10SRE, 10Services, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Shellbox - https://phabricator.wikimedia.org/T281423 (10Joe) So I found the problem: in https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/673136 we removed the run stanza from builder, thus making the require... [17:34:18] (03CR) 10CDanis: [C: 03+1] "LGTM! I agree that SameSite Lax is the most sensible." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) (owner: 10Jbond) [17:35:46] (03PS1) 10TrainBranchBot: Update train-versions.json [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/698582 [17:35:48] (03CR) 10TrainBranchBot: [C: 03+2] Update train-versions.json [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/698582 (owner: 10TrainBranchBot) [17:36:24] 10ops-codfw, 10DBA, 10Data-Persistence-Backup: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) @Papaul I've shutdown this host and downtime'd it until the 16th (when I am back) so it can be serviced at anytime without requiring coordinati... [17:36:50] (03PS4) 10Jcrespo: dbbackups: Switchover codfw s5 backups from db2099 to db2101 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) [17:36:52] (03Merged) 10jenkins-bot: Update train-versions.json [mediawiki-config] (sandbox/dancy) - 10https://gerrit.wikimedia.org/r/698582 (owner: 10TrainBranchBot) [17:37:05] (03PS1) 10Jbond: docker-reporter: filter out old removed images [puppet] - 10https://gerrit.wikimedia.org/r/698583 (https://phabricator.wikimedia.org/T251918) [17:37:43] (03PS6) 10Jbond: P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) [17:38:18] (03PS7) 10Jbond: P:idp::client::http:site: add support for same site cookie [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) [17:38:42] (03CR) 10Jbond: "thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/697730 (https://phabricator.wikimedia.org/T264605) (owner: 10Jbond) [17:39:55] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Switchover codfw s5 backups from db2099 to db2101 (buster) [puppet] - 10https://gerrit.wikimedia.org/r/693142 (https://phabricator.wikimedia.org/T283235) (owner: 10Jcrespo) [17:40:07] Quick update on the cirrussearch poolcounter rejections, we're working on figuring out the source IP of the actor that's slamming us with these requests. Hadoop has a table `event.mediawiki_cirrussearch_request` that has request info so that's where we're poking around currently [17:41:51] RECOVERY - snapshot of s6 in codfw on alert1001 is OK: Last snapshot for s6 at codfw (db2141.codfw.wmnet:3316) taken on 2021-06-07 16:31:01 (575 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [17:45:30] (03PS2) 10Jbond: docker-reporter: filter out old removed images [puppet] - 10https://gerrit.wikimedia.org/r/698583 (https://phabricator.wikimedia.org/T251918) [17:45:54] 10SRE, 10Traffic, 10vm-requests, 10Patch-For-Review: Please create two Ganeti VMs for Wikidough in eqiad - https://phabricator.wikimedia.org/T284348 (10Volans) p:05Triage→03Medium [17:46:33] (03PS3) 10Jbond: docker-reporter: filter out old removed images [puppet] - 10https://gerrit.wikimedia.org/r/698583 (https://phabricator.wikimedia.org/T251918) [17:47:16] (03PS1) 10Ottomata: Update docs for kafka/roll-restart-brokers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/698586 [17:49:20] (03CR) 10Elukey: Update docs for kafka/roll-restart-brokers.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/698586 (owner: 10Ottomata) [17:49:33] jouncebot: next [17:49:33] In 0 hour(s) and 10 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210607T1800) [17:50:30] (03PS1) 10Majavah: toolforge: Remove non-helm ingress-nginx files [puppet] - 10https://gerrit.wikimedia.org/r/698588 (https://phabricator.wikimedia.org/T264221) [17:50:34] (03PS2) 10Ottomata: Update docs for kafka/roll-restart-brokers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/698586 [17:50:43] (03CR) 10Ottomata: Update docs for kafka/roll-restart-brokers.py (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/698586 (owner: 10Ottomata) [17:51:41] (03CR) 10Elukey: [C: 03+1] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/698586 (owner: 10Ottomata) [17:53:53] !log rolling restart of kafka jumbo mirror makers - T283067 [17:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:58] !log otto@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker [17:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:10] (03CR) 10Urbanecm: [C: 03+2] Make it possible to deploy welcomesurvey to % of users that's not divisible by 10 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698364 (https://phabricator.wikimedia.org/T284127) (owner: 10Urbanecm) [17:55:12] (03CR) 10Urbanecm: [C: 03+2] Align welcome survey group with homepage group [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698365 (https://phabricator.wikimedia.org/T284257) (owner: 10Urbanecm) [17:55:14] (03CR) 10Urbanecm: [C: 03+2] initWikiConfig.php: Use same link ID for help panel links as community configuration would [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698527 (https://phabricator.wikimedia.org/T284072) (owner: 10Urbanecm) [17:55:17] (03CR) 10Urbanecm: [C: 03+2] initWikiConfig.php: Use links to MW.org as fallbacks to Wikidata [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698549 (https://phabricator.wikimedia.org/T284072) (owner: 10Urbanecm) [17:58:24] (03CR) 10Ottomata: [C: 03+2] Update docs for kafka/roll-restart-brokers.py [cookbooks] - 10https://gerrit.wikimedia.org/r/698586 (owner: 10Ottomata) [18:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210607T1800). [18:00:05] Urbanecm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] i'll self-serve [18:01:16] (03CR) 10Urbanecm: [C: 03+2] Set WelcomeSurveyEnableWithHomepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698552 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [18:02:03] (03Merged) 10jenkins-bot: Set WelcomeSurveyEnableWithHomepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698552 (https://phabricator.wikimedia.org/T281896) (owner: 10Urbanecm) [18:04:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5de2f8b27b016a2cd8f424d8e40318edde5e5704: Set WelcomeSurveyEnableWithHomepage (T281896, T284257) (duration: 00m 59s) [18:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:15] T281896: Deploy Growth features on English Wikipedia - https://phabricator.wikimedia.org/T281896 [18:04:15] T284257: Align welcome survey treatment group with homepage treatment group - https://phabricator.wikimedia.org/T284257 [18:04:19] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=skwiki growthexperiments # T284149 [18:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:25] T284149: Deploy Growth features on Slovak Wikipedia - https://phabricator.wikimedia.org/T284149 [18:05:25] (03PS2) 10Urbanecm: skwiki: Make Growth features available in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698551 (https://phabricator.wikimedia.org/T284149) [18:05:28] (03CR) 10Urbanecm: [C: 03+2] skwiki: Make Growth features available in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698551 (https://phabricator.wikimedia.org/T284149) (owner: 10Urbanecm) [18:06:44] (03Merged) 10jenkins-bot: skwiki: Make Growth features available in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698551 (https://phabricator.wikimedia.org/T284149) (owner: 10Urbanecm) [18:09:20] (03PS17) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [18:10:47] (03PS18) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [18:12:12] !log otto@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) [18:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:55] (03PS19) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [18:13:17] (03CR) 10Jbond: "updated thanks" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [18:13:29] !log urbanecm@deploy1002 Synchronized wmf-config/config/skwiki.yaml: 15e09109b7c45de967a496a0eb58ad267dbc5079: skwiki: Make Growth features available in dark mode (T284149; 1/3) (duration: 00m 59s) [18:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:33] T284149: Deploy Growth features on Slovak Wikipedia - https://phabricator.wikimedia.org/T284149 [18:14:23] !log rolling restart of kafka jumbo brokers - T283067 [18:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:26] !log otto@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [18:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:49] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 15e09109b7c45de967a496a0eb58ad267dbc5079: skwiki: Make Growth features available in dark mode (T284149; 2/3) (duration: 00m 56s) [18:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:21] 10SRE, 10Traffic, 10netops: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) Some notes from last week's IRC chat: * It would be useful to count the ICMP PTB messages we received (regardless of if they arrive on the correct server or not) so we know ** H... [18:16:42] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 15e09109b7c45de967a496a0eb58ad267dbc5079: skwiki: Make Growth features available in dark mode (T284149; 3/3) (duration: 00m 56s) [18:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:16] (03Merged) 10jenkins-bot: Make it possible to deploy welcomesurvey to % of users that's not divisible by 10 [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698364 (https://phabricator.wikimedia.org/T284127) (owner: 10Urbanecm) [18:17:19] (03Merged) 10jenkins-bot: Align welcome survey group with homepage group [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698365 (https://phabricator.wikimedia.org/T284257) (owner: 10Urbanecm) [18:17:21] (03Merged) 10jenkins-bot: initWikiConfig.php: Use same link ID for help panel links as community configuration would [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698527 (https://phabricator.wikimedia.org/T284072) (owner: 10Urbanecm) [18:17:24] (03Merged) 10jenkins-bot: initWikiConfig.php: Use links to MW.org as fallbacks to Wikidata [extensions/GrowthExperiments] (wmf/1.37.0-wmf.7) - 10https://gerrit.wikimedia.org/r/698549 (https://phabricator.wikimedia.org/T284072) (owner: 10Urbanecm) [18:18:32] (03CR) 10Bstorm: toolforge: Remove non-helm ingress-nginx files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/698588 (https://phabricator.wikimedia.org/T264221) (owner: 10Majavah) [18:20:18] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/maintenance/initWikiConfig.php: 7089728: b2482fb: initWikiConfig GE backports (T284072) (duration: 00m 58s) [18:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:22] T284072: Streamline the deployment process of Growth features, by automatically prefill the Community configuration page - https://phabricator.wikimedia.org/T284072 [18:22:41] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/extension.json: 368b5d9: 0e79aee: WelcomeSurvey backports (T284127, T284257; 1/2) (duration: 00m 56s) [18:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:48] T284257: Align welcome survey treatment group with homepage treatment group - https://phabricator.wikimedia.org/T284257 [18:22:49] T284127: Make it possible to deploy welcomesurvey to % of users that's not divisible by 10 - https://phabricator.wikimedia.org/T284127 [18:24:18] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/includes/WelcomeSurvey.php: 368b5d9: 0e79aee: WelcomeSurvey backports (T284127, T284257; 2/2) (duration: 00m 57s) [18:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:59] !log [urbanecm@mwmaint1002 /srv/mediawiki/php-1.37.0-wmf.7]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=skwiki --phab=T284149 [18:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:04] T284149: Deploy Growth features on Slovak Wikipedia - https://phabricator.wikimedia.org/T284149 [18:26:15] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 77 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:34:59] hey Amir1 you should have access to those slides now [18:35:15] too bad there isn't a button I can click to "give to all WMF/WMDE", it's irritating [18:36:26] apergos: Thanks! [18:36:36] yeah, sorry about that >_< [18:36:48] the second half of those slides is the same slides plus the speaker notes [18:36:58] just fyi anyone who's looking at 'em [18:37:51] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 40 probes of 624 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:38:15] (03CR) 10Bstorm: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/29813/tools-prometheus-03.tools.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/698578 (https://phabricator.wikimedia.org/T284353) (owner: 10Majavah) [18:40:14] (03CR) 10Bstorm: [C: 03+2] galera: clear up confusing xtrabackup parameter [puppet] - 10https://gerrit.wikimedia.org/r/698251 (https://phabricator.wikimedia.org/T284157) (owner: 10Bstorm) [18:43:09] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:48:57] (03PS1) 10Ahmon Dancy: Test commit. Disregard [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/698601 [18:49:53] (03CR) 10Ahmon Dancy: [C: 03+2] Test commit. Disregard [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/698601 (owner: 10Ahmon Dancy) [18:57:34] !log prometheus3001: moved /srv back to vda1 filesystem T243057 [18:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:38] T243057: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 [18:59:09] !log andrew@deploy1002 Started deploy [horizon/deploy@6199b67]: disable shelve/unshelve [18:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:10] !log andrew@deploy1002 Finished deploy [horizon/deploy@6199b67]: disable shelve/unshelve (duration: 02m 01s) [19:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:23] !log andrew@deploy1002 Started deploy [horizon/deploy@6199b67]: disable shelve/unshelve T284462 [19:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:27] T284462: Horizon should get confirmation for shelve operations - https://phabricator.wikimedia.org/T284462 [19:03:25] PROBLEM - Prometheus prometheus3001/ops restarted: beware possible monitoring artifacts on prometheus3001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [19:04:19] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [19:07:16] !log andrew@deploy1002 Finished deploy [horizon/deploy@6199b67]: disable shelve/unshelve T284462 (duration: 04m 53s) [19:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:47] (03CR) 10jerkins-bot: [V: 04-1] Test commit. Disregard [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/698601 (owner: 10Ahmon Dancy) [19:10:57] (03PS1) 10CDanis: temp limit GAE/GCE traffic towards search API [puppet] - 10https://gerrit.wikimedia.org/r/698607 (https://phabricator.wikimedia.org/T284479) [19:12:32] 10SRE, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10Eevans) p:05Medium→03Low Did we decide against this? Is this issue still valid? [19:13:31] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:13:59] (03CR) 10Ayounsi: [C: 03+1] temp limit GAE/GCE traffic towards search API [puppet] - 10https://gerrit.wikimedia.org/r/698607 (https://phabricator.wikimedia.org/T284479) (owner: 10CDanis) [19:15:14] (03CR) 10BBlack: [C: 03+1] temp limit GAE/GCE traffic towards search API [puppet] - 10https://gerrit.wikimedia.org/r/698607 (https://phabricator.wikimedia.org/T284479) (owner: 10CDanis) [19:16:05] (03CR) 10CDanis: [C: 03+2] temp limit GAE/GCE traffic towards search API [puppet] - 10https://gerrit.wikimedia.org/r/698607 (https://phabricator.wikimedia.org/T284479) (owner: 10CDanis) [19:19:53] !log T284479 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕞🍵 sudo cumin -b16 'A:cp-text' "run-puppet-agent" [19:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:58] T284479: Cirrussearch: spike in pool counter rejections related to full_text and entity_full_text queries - https://phabricator.wikimedia.org/T284479 [19:21:19] !log T284479 [Cirrussearch] We're working on rolling out https://gerrit.wikimedia.org/r/698607, which will ban search API requests that match the Google App Engine IP range `2600:1900::0/28` AND whose user agent includes `HeadlessChrome` [19:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:37] cdanis: can you remind me how you make the clock stuff work? [19:21:55] (03CR) 10Herron: [C: 03+1] "lgtm thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [19:22:47] RhinosF1: https://github.com/cdanis/dotfiles/blob/master/zsh/.zshfunc/prompt_cdanis1_setup [19:22:48] RhinosF1: https://github.com/wikimedia/puppet/blob/production/modules/admin/files/home/cdanis/.zshfunc/prompt_cdanis1_setup [19:22:49] :) [19:22:54] (03PS10) 10Herron: mx2001: disable ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/612826 (owner: 10Jbond) [19:24:27] Ty [19:24:30] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Volans) Anything still pending here on the #sre-access-requests side? [19:25:26] (03CR) 10Herron: [V: 03+2 C: 03+2] onboard apigw dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/698569 (owner: 10Herron) [19:25:40] !log T284479 [Cirrussearch] Seeing the expected drop in `entity_full_text` requests here: https://grafana-rw.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=47&orgId=1&from=now-12h&to=now As a result we're no longer rejecting any requests [19:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:46] T284479: Cirrussearch: spike in pool counter rejections related to full_text and entity_full_text queries - https://phabricator.wikimedia.org/T284479 [19:25:49] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [19:26:45] RECOVERY - Prometheus prometheus3001/ops restarted: beware possible monitoring artifacts on prometheus3001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [19:29:42] 10SRE, 10Analytics, 10Analytics-Kanban, 10Product-Analytics, and 2 others: Requesting access to analytics-privatedata-users for schoenbaechler - https://phabricator.wikimedia.org/T283190 (10Volans) Adding @lucyblackwell for approval. [19:30:56] !log T284479 [Cirrussearch] We'll keep monitoring. For now this incident is resolved. Glancing at our current volume relative to what we'd expect, the numbers we see match what we'd expect. If we're accidentally banning any innocent requests they must be an incredibly small percentage of the total otherwise we'd see significantly lower volume than expected [19:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:01] T284479: Cirrussearch: spike in pool counter rejections related to full_text and entity_full_text queries - https://phabricator.wikimedia.org/T284479 [19:32:09] (03CR) 10Jforrester: "In future, when we decommission old images should we write a patch adding them to this regex?" [puppet] - 10https://gerrit.wikimedia.org/r/698583 (https://phabricator.wikimedia.org/T251918) (owner: 10Jbond) [19:32:39] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:33:37] 10SRE, 10Cassandra, 10Patch-For-Review: Move cassandra puppet code (used by Restbase, Sessionstore, AQS, maps) to profile::java - https://phabricator.wikimedia.org/T261966 (10Eevans) p:05Medium→03Low [19:34:32] 10SRE, 10Patch-For-Review, 10Platform Engineering (Icebox), 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10Eevans) [19:35:55] 10SRE, 10Cassandra, 10User-jbond: Create a cassandra.service which subsumes casandra-{a,b,c} services using PartsOf=cassandra.service - https://phabricator.wikimedia.org/T229916 (10Eevans) p:05Medium→03Low [19:39:21] PROBLEM - MariaDB Replica Lag: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:41:53] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 15.2 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:44:39] 10Puppet, 10Analytics-Radar, 10observability, 10Services (watching), 10User-Elukey: Upgrade prometheus-jmx-exporter on all services using it - https://phabricator.wikimedia.org/T192948 (10Eevans) [19:45:05] 10SRE, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10herron) The 150G secondary disk has been removed from the prometheus3001 VM. Strangely after gnt-instance shutdown/start prometheus3001 its network interface was renamed.... [19:45:31] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 95.74 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:46:35] PROBLEM - MariaDB Replica Lag: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:46:57] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [19:46:59] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [19:47:01] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [19:47:31] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [19:48:32] 10SRE, 10Goal, 10User-Eevans: Handle HBA controllers in get-raid-status-hpssacli - https://phabricator.wikimedia.org/T185216 (10Eevans) [19:51:57] PROBLEM - MariaDB Replica Lag: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:00:04] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210607T2000). [20:03:11] 10SRE, 10Cassandra, 10Services (watching), 10User-Eevans: Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590 (10Eevans) 05Stalled→03Declined Boldly closing //declined//, please reopen (at a commiserate priority) if this is something we... [20:10:21] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [20:10:25] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [20:10:25] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [20:10:55] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [20:23:47] (03PS1) 10Ahmon Dancy: Test commit. Disregard [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/698620 [20:24:25] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) [20:27:17] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10ssingh) >>! In T281344#7139967, @Volans wrote: > Anything still pending here on the #sre-access-requests side? Thanks for checking! I updated pwstore as that was completed; I am not sure about the rema... [20:40:12] (03PS1) 10Bartosz Dziewoński: Enable DiscussionTools' topicsubscription as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698622 (https://phabricator.wikimedia.org/T274280) [20:40:51] (03CR) 10Bartosz Dziewoński: [C: 04-1] "Do not merge until T282699 is resolved" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698622 (https://phabricator.wikimedia.org/T274280) (owner: 10Bartosz Dziewoński) [20:41:25] (03CR) 10jerkins-bot: [V: 04-1] Enable DiscussionTools' topicsubscription as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698622 (https://phabricator.wikimedia.org/T274280) (owner: 10Bartosz Dziewoński) [20:46:38] (03CR) 10Andrew Bogott: [C: 03+1] "I would like to merge this before the actual migration event so that we can double-check that connectivity works. Does anyone object to m" [puppet] - 10https://gerrit.wikimedia.org/r/689092 (https://phabricator.wikimedia.org/T282209) (owner: 10Andrew Bogott) [20:49:07] RECOVERY - MariaDB Replica Lag: pc1 on pc2007 is OK: OK slave_sql_lag Replication lag: 59.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:53:10] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools' topicsubscription as beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/698622 (https://phabricator.wikimedia.org/T274280) [20:57:53] (03CR) 10Ahmon Dancy: [C: 03+2] Test commit. Disregard [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/698620 (owner: 10Ahmon Dancy) [21:00:04] Reedy and sbassett: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210607T2100). [21:06:05] Hey all - deploying a security patch for T284364 now. [21:12:32] !log Deployed security patch for T284364 [21:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:24] (03Merged) 10jenkins-bot: Test commit. Disregard [core] (wmf/1.37.0-wmf.6) - 10https://gerrit.wikimedia.org/r/698620 (owner: 10Ahmon Dancy) [21:26:01] (03PS2) 10Urbanecm: lvwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697928 (https://phabricator.wikimedia.org/T278191) [21:26:22] !log otto@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [21:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:47] (03PS3) 10Urbanecm: lvwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697928 (https://phabricator.wikimedia.org/T278191) [21:41:35] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:50:10] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for west1 - https://phabricator.wikimedia.org/T284136 (10leila) >>! In T284136#7133515, @elukey wrote: >>>! In T284136#7133186, @colewhite wrote: >> @KFrancis can you confirm an NDA on file for @Cervisiarius? > [...] > * the point of contact in puppet (Aaron i... [22:23:13] RECOVERY - snapshot of s3 in codfw on alert1001 is OK: Last snapshot for s3 at codfw (db2139.codfw.wmnet:3313) taken on 2021-06-07 19:20:34 (1055 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [22:37:47] (03CR) 10Cwhite: [C: 03+1] alertmanager: print link separators on IRC when needed [puppet] - 10https://gerrit.wikimedia.org/r/698459 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [22:37:57] (03CR) 10Cwhite: [C: 03+1] alertmanager: update dashboard minimum group width to 2048 [puppet] - 10https://gerrit.wikimedia.org/r/698507 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [22:54:01] RECOVERY - snapshot of x1 in codfw on alert1001 is OK: Last snapshot for x1 at codfw (db2101.codfw.wmnet:3320) taken on 2021-06-07 22:30:26 (285 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210607T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:06:45] 10SRE, 10LDAP-Access-Requests: Add Dat Nguyen to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284285 (10RStallman-legalteam) Thank you! I have sent the NDA for signatures via Docusign. [23:07:18] 10SRE, 10LDAP-Access-Requests: Add Kara Payne to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T284308 (10RStallman-legalteam) Thank you! I have sent the NDA for signatures via Docusign. [23:07:43] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:07] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:17:31] (03PS1) 10Bstorm: dumps distribution: uncomment sagres.c3sl.ufpr.br [puppet] - 10https://gerrit.wikimedia.org/r/698636 [23:20:33] 10SRE, 10Technical-blog-posts, 10Wikimedia-Mailing-lists: Story idea for Blog: Discovering and fixing CVE-2021-33038 in Mailman3 - https://phabricator.wikimedia.org/T284486 (10Ladsgroup) We should write a blog post about the upgrade in general too. Maybe later. [23:20:39] (03CR) 10Bstorm: "I think this is good to move forward based on emails. I tried running:" [puppet] - 10https://gerrit.wikimedia.org/r/698636 (owner: 10Bstorm) [23:22:58] (03CR) 10Bstorm: "this would probably be good. Seems the commit message has a typo? Isn't it profile::nginx?" [puppet] - 10https://gerrit.wikimedia.org/r/698485 (owner: 10Muehlenhoff) [23:26:28] (03CR) 10Bstorm: [C: 03+1] "I'm not very fussed about the commit msg personally 😊" [puppet] - 10https://gerrit.wikimedia.org/r/698485 (owner: 10Muehlenhoff) [23:40:17] (03CR) 10Ladsgroup: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/697637 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup)