[00:06:32] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:15:30] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:39:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:49:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:49:38] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:49:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:56:10] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.350 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:56:12] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 47967 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:17:46] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:37:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:39:57] <wikibugs>	 10SRE, 10Privacy Engineering, 10Research, 10Security-Team, and 3 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10bmansurov) @leila,  Remaining items: 1. Implement https://wikitech.wikimedia.org/wiki/Tool:Event_Metrics...
[03:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:22:45] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[04:55:10] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1014 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 43233.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:55:10] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1018 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 43234.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:02:26] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 43670.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:02:52] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 on db1155 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1091, Errmsg: Error Cant DROP INDEX new_name_timestamp: check that it exists on query. Default database: bgwiktionary. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[05:13:54] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:15:08] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:30:44] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki-history-drop-snapshot.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:37:55] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:42:32] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220430T0700)
[07:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:18:37] <marostegui>	 Amir1: db1155 broke replication
[07:18:59] <marostegui>	 I cannot check as I'm not next to my computer, I will be in like 3-4h 
[07:20:04] <marostegui>	 looks like related to your schema change 
[07:20:20] <marostegui>	 it is not user impacting (it impacts wiki replicas as they'll have delay only)
[07:22:45] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:32:22] <Amir1>	 marostegui: I will get to it asap
[07:32:53] <marostegui>	 I created a task for it
[07:38:14] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 53018.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:45:28] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 on db1155 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:48:26] <marostegui>	 Amir1: with replication? check the clouddbb
[07:48:54] <Amir1>	 marostegui: yup, the are also catching up
[07:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:40:12] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1018 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:40:12] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1014 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:40:30] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[08:40:42] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on db1155 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:43:58] <wikibugs>	 10SRE: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10Volans) @fgiunchedi yes and no, duplicates within the operations/dns repository are currently catched, but duplication within the automatically-generated data or between the manual and the generated...
[09:50:48] <wikibugs>	 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: wait reboot time timeout on aqs nodes - https://phabricator.wikimedia.org/T307260 (10Volans)
[09:58:11] <wikibugs>	 (03PS3) 10Majavah: P:toolforge::prometheus: simplify prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/779474 (https://phabricator.wikimedia.org/T304716)
[09:59:58] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35012/console" [puppet] - 10https://gerrit.wikimedia.org/r/779474 (https://phabricator.wikimedia.org/T304716) (owner: 10Majavah)
[10:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:12:04] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering, 10Privacy Engineering: Swift for differential privacy data publication - https://phabricator.wikimedia.org/T307245 (10Peachey88)
[11:28:46] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:04:46] <wikibugs>	 (03PS1) 10KartikMistry: Enable SectionTranslation in testwiki for af, as, gu, kn, mk and se [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787868 (https://phabricator.wikimedia.org/T304828)
[12:07:30] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:29:56] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:03:10] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:16] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-ebernhardson-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:10] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-ebernhardson-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:07] <Cyberpower678>	 Is something currently wrong with OAuth, or is something broken in my app.
[14:26:10] <Cyberpower678>	 An error occurred connecting to your account: 
[14:26:10] <Cyberpower678>	 Error retrieving token: mwoauthdatastore-request-token-not-found: Sorry, something went wrong connecting this application. Go back and try to connect your account again, or contact the application author. OAuth token not found, E004
[14:26:10] <Cyberpower678>	 Click here to try to login again!!!
[14:26:23] <Cyberpower678>	 I keep getting ^ when trying to log in.
[14:27:04] <Cyberpower678>	 It just started happening now as far as I can tell.  But nothing in the codebase or the consumers have changed.
[14:29:06] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering, 10Data-Persistence, 10Privacy Engineering: Swift for differential privacy data publication - https://phabricator.wikimedia.org/T307245 (10RhinosF1)
[14:29:44] <RhinosF1>	 Cyberpower678: nothing changes on a Saturday
[14:30:02] <Cyberpower678>	 Then what could be causing this weird error.
[14:30:17] <Cyberpower678>	 Because my OAuth code hasn't changed in years, and neither have the consumers
[14:30:22] <RhinosF1>	 Have you tried restarting it and logging in again?
[14:30:41] <RhinosF1>	 Could you check the consumers are still active?
[14:33:13] <Cyberpower678>	 RhinosF1: restarting the webservice?
[14:34:11] <RhinosF1>	 Cyberpower678: if that's what handling OAuth
[14:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:51:25] <Cyberpower678>	 RhinosF1: webservice restart did not help
[14:52:07] <RhinosF1>	 Cyberpower678: and the tokens are definitely still active?
[14:52:13] <Cyberpower678>	 Yes
[14:57:14] <Cyberpower678>	 Oh wait, I think there might be some DB weirdness going on.  I just upgrade the VM for the DB and apps are still connecting to the legacy domains.
[15:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:19:10] <Cyberpower678>	 RhinosF1: nope.  Not a DB issue.
[15:20:30] <RhinosF1>	 Cyberpower678: could something have copied across wrong?
[15:21:22] <Cyberpower678>	 Not likely.  I'm actually getting taken to the consumer authorization part of OAuth.  The failure happens when OAuth received the verification tokens from MediaWiki
[15:21:59] <Cyberpower678>	 Which means the request it's generating to send the user to the auth page is a valid one.  Implying OAuth keys are set properly.
[15:22:49] <Cyberpower678>	 (Though I wonder if there might be an issue with writing to sessions.  The sessions settings are different, and may still be mapped to the legacy domains
[15:23:29] <Cyberpower678>	 IABot has a crap ton of config variables.
[15:29:16] <Cyberpower678>	 RhinosF1: there we go, I forgot to update the configuration that governs session data, which also goes to the DB.
[15:29:49] <Cyberpower678>	 So the session was getting lost the moment the user left the page
[15:33:02] <RhinosF1>	 Cyberpower678: ah!
[15:33:24] <Cyberpower678>	 Yep.  Details details details
[15:44:44] <wikibugs>	 (03PS2) 10KartikMistry: Enable SectionTranslation in testwiki for af, as, gu, kn, mk and sr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787868 (https://phabricator.wikimedia.org/T304828)
[15:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:35:16] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:21:03] <tgr>	 Cyberpower678: FWIW https://www.mediawiki.org/wiki/Help:OAuth/Errors can help give you a basic idea of where to look for a given error
[18:36:59] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:28:39] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:55] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:40:17] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:37:56] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:44:01] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:44:33] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:01:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:41:25] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:50:55] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale