[00:04:05] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:05:11] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [00:07:50] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5004.wikimedia.org with OS buster [00:08:00] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster completed: - dns5004 (**PASS**)... [00:09:07] !log rzl@cumin1001 conftool action : set/pooled=no; selector: name=mw14(39|40).eqiad.wmnet,cluster=videoscaler [00:09:48] !log rzl@cumin1001 conftool action : set/pooled=no; selector: name=mw14(45|46).eqiad.wmnet,cluster=jobrunner [00:13:43] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:19:45] RECOVERY - Disk space on aphlict1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops [00:22:12] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:30:09] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:31:01] RECOVERY - Wikitech and wt-static content in sync on cloudweb1003 is OK: wikitech-static OK - wikitech and wikitech-static in sync (168447 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [00:31:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:32:45] (Traffic bill over quota) firing: (2) Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:35:40] (03PS1) 10Ssingh: Revert "lvs5004: commission new LVS host (eqsin hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/862916 [00:36:21] (03PS1) 10Ssingh: Revert "hiera: temporarily remove references to dns5004" [puppet] - 10https://gerrit.wikimedia.org/r/862917 [00:36:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [00:37:44] (03CR) 10CI reject: [V: 04-1] Revert "lvs5004: commission new LVS host (eqsin hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/862916 (owner: 10Ssingh) [00:37:45] (Traffic bill over quota) firing: (3) Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:38:32] (03PS2) 10Ssingh: Revert "lvs5004: commission new LVS host (eqsin hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/862916 [00:39:10] (03CR) 10CI reject: [V: 04-1] Revert "lvs5004: commission new LVS host (eqsin hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/862916 (owner: 10Ssingh) [00:41:31] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/862916 (owner: 10Ssingh) [00:42:12] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:42:30] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RLazarus) [00:42:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10RLazarus) 05Stalled→03Open [00:42:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10RLazarus) Over to dcops! @Jclark-ctr Sorry to hand this off right as you're gone, but whenever you're back in the DC, these servers are offline and ready to be physically removed. [00:44:24] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/862916 (owner: 10Ssingh) [00:46:49] (03CR) 10Ssingh: [C: 03+2] Revert "lvs5004: commission new LVS host (eqsin hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/862916 (owner: 10Ssingh) [00:52:45] (Traffic bill over quota) firing: (3) Alert for device cr1-codfw.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [00:57:45] (Traffic bill over quota) resolved: Alert for device cr3-ulsfo.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [01:02:41] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: [01:02:41] tps://wikitech.wikimedia.org/wiki/Wikifeeds [01:03:43] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [01:20:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:31:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:36:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:04:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [02:12:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:36] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [02:37:38] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [03:22:40] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [03:26:34] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [03:52:08] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:54:06] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:17:11] (03PS1) 10Andrew Bogott: Revert "oslo_messaging_rabbit: increase retry and backoff by a lot" [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816) [04:28:28] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [04:29:51] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [04:32:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:38:17] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [04:39:29] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [05:00:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) Hi, The public key I submitted in the initial ticket matches the key that I have saved on my device. My config looks correct but I am still bein... [05:41:10] (03PS1) 10KartikMistry: Enable Section Translation on 8 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863097 (https://phabricator.wikimedia.org/T319176) [05:58:56] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10Marostegui) @Ottomata can you confirm if this also needs analytics-privatedata-users group membership without ssh and kerberos? We need @XenoRyet to approve as well. [06:04:26] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [06:05:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Marostegui) @jcrespo do you want/have a tracking task to productionize these hosts? [06:06:26] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [06:06:35] (03PS1) 10Marostegui: install_server: Do not reimage db120[45] [puppet] - 10https://gerrit.wikimedia.org/r/863098 [06:07:19] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db120[45] [puppet] - 10https://gerrit.wikimedia.org/r/863098 (owner: 10Marostegui) [06:13:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134', diff saved to https://phabricator.wikimedia.org/P42204 and previous config saved to /var/cache/conftool/dbconfig/20221202-061259-marostegui.json [06:14:58] (03PS1) 10Marostegui: db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/863099 [06:15:40] (03CR) 10Marostegui: [C: 03+2] db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/863099 (owner: 10Marostegui) [06:19:55] (03PS1) 10Marostegui: mariadb: Add db1206 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/863100 (https://phabricator.wikimedia.org/T324181) [06:22:34] (03CR) 10Marostegui: [C: 03+2] mariadb: Add db1206 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/863100 (https://phabricator.wikimedia.org/T324181) (owner: 10Marostegui) [06:30:15] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 185 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:34:19] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:40:59] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:42:21] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:43:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [06:45:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [06:45:33] (03PS2) 10KartikMistry: testwiki: Enable Section Translation for 15 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862412 (https://phabricator.wikimedia.org/T323825) [06:57:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P42206 and previous config saved to /var/cache/conftool/dbconfig/20221202-065745-ladsgroup.json [07:12:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P42207 and previous config saved to /var/cache/conftool/dbconfig/20221202-071250-ladsgroup.json [07:27:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P42208 and previous config saved to /var/cache/conftool/dbconfig/20221202-072755-ladsgroup.json [07:37:48] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) is CRITICAL: Test retrieve selected events on January 15 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [07:39:44] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [07:41:27] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:41:29] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:41:34] !log draining ganeti5001 for eventual decom T322048 [07:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:37] T322048: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 [07:43:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P42209 and previous config saved to /var/cache/conftool/dbconfig/20221202-074300-ladsgroup.json [07:43:08] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:43:10] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:44:43] (03PS1) 10Marostegui: Revert "db1134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/862918 [07:46:18] (03CR) 10Marostegui: [C: 03+2] Revert "db1134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/862918 (owner: 10Marostegui) [07:46:44] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [07:48:12] (03PS1) 10Marostegui: site.pp: db1132 is no longer special [puppet] - 10https://gerrit.wikimedia.org/r/863105 [07:49:01] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:49:04] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:49:32] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:49:35] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:49:41] (03CR) 10Marostegui: [C: 03+2] site.pp: db1132 is no longer special [puppet] - 10https://gerrit.wikimedia.org/r/863105 (owner: 10Marostegui) [07:49:44] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:49:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:52:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:56:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 5%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42210 and previous config saved to /var/cache/conftool/dbconfig/20221202-075601-root.json [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221202T0800) [08:09:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:10:48] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:11:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42211 and previous config saved to /var/cache/conftool/dbconfig/20221202-081106-root.json [08:12:08] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [08:16:46] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 (owner: 10Volans) [08:26:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42212 and previous config saved to /var/cache/conftool/dbconfig/20221202-082611-root.json [08:29:50] (03CR) 10David Caro: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott) [08:31:10] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:32:27] (03PS1) 10Zabe: Start writing to cul_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863228 (https://phabricator.wikimedia.org/T233004) [08:32:55] (03PS1) 10Aklapper: Redirect phabricator.wikimedia.org/r/ to gerrit.wikimedia.org/g/ [puppet] - 10https://gerrit.wikimedia.org/r/863229 (https://phabricator.wikimedia.org/T324311) [08:33:02] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [08:36:22] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [08:39:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10jcrespo) This should do: T313582 [08:40:12] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [08:40:56] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:41:15] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [08:41:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42213 and previous config saved to /var/cache/conftool/dbconfig/20221202-084116-root.json [08:41:24] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:41:40] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:44:44] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:48:40] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [08:56:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42214 and previous config saved to /var/cache/conftool/dbconfig/20221202-085621-root.json [08:59:43] (03CR) 10David Caro: "Will play with it more, just a quick review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [09:08:54] (03CR) 10Elukey: [C: 03+2] kubeflow-kfserving: move to the Wikimedia storage-initializer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/711096 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:09:44] (03CR) 10Elukey: [C: 03+2] "test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765242 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [09:09:52] (03CR) 10Elukey: [C: 03+2] "test" [deployment-charts] - 10https://gerrit.wikimedia.org/r/765235 (https://phabricator.wikimedia.org/T301415) (owner: 10Elukey) [09:11:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: After cloning db1206', diff saved to https://phabricator.wikimedia.org/P42215 and previous config saved to /var/cache/conftool/dbconfig/20221202-091126-root.json [09:12:50] PROBLEM - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: account=mw-media class=thumb https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [09:16:01] (03CR) 10Vgutierrez: [C: 03+2] Release 0.36 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/863028 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [09:17:48] Emperor: FYI swift objects count between the DCs ^^^ [09:19:25] volans: thanks, I don't think we care about that very much (not least we don't try and blindly replicate thumbs like we used to), but cc godog who set that alert up :) [09:20:11] yeah I'd say safe to ignore thumbs at this point [09:21:07] so should we just blindly remove the alert? [09:21:18] I'll defer to Emperor for that :) [09:22:48] have to run to a doc appt, bbiab [09:22:51] * Emperor isn't sure it's ever been useful, so maybe +1 to removing it [09:23:59] (03PS1) 10Vgutierrez: setup.py: update dependencies for bullseye [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/863231 (https://phabricator.wikimedia.org/T321309) [09:24:01] (03PS1) 10Vgutierrez: Release 0.36 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/863232 (https://phabricator.wikimedia.org/T321309) [09:24:03] (03PS1) 10Vgutierrez: debian: Add release 0.36 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/863233 (https://phabricator.wikimedia.org/T321309) [09:25:11] (03PS1) 10David Caro: harbor: fix interpolation in onlyif detecting compose [puppet] - 10https://gerrit.wikimedia.org/r/863234 (https://phabricator.wikimedia.org/T324314) [09:27:49] (03CR) 10CI reject: [V: 04-1] harbor: fix interpolation in onlyif detecting compose [puppet] - 10https://gerrit.wikimedia.org/r/863234 (https://phabricator.wikimedia.org/T324314) (owner: 10David Caro) [09:32:06] (03PS2) 10David Caro: harbor: fix interpolation in onlyif detecting compose [puppet] - 10https://gerrit.wikimedia.org/r/863234 (https://phabricator.wikimedia.org/T324314) [09:32:31] (03CR) 10Alexandros Kosiaris: Update the spark and spark-operator images (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [09:34:19] (03PS1) 10Muehlenhoff: Switch to ganeti5004 in blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/863236 (https://phabricator.wikimedia.org/T322048) [09:35:29] (03PS1) 10Muehlenhoff: Remove ganeti5001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/863237 (https://phabricator.wikimedia.org/T322048) [09:41:49] (03CR) 10Vgutierrez: [C: 03+2] setup.py: update dependencies for bullseye [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/863231 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [09:41:58] (03CR) 10Vgutierrez: [C: 03+2] Release 0.36 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/863232 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [09:44:33] (03PS2) 10Volans: setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 [09:44:35] (03PS2) 10Volans: spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) [09:46:24] (03PS2) 10Vgutierrez: debian: Add release 0.36 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/863233 (https://phabricator.wikimedia.org/T321309) [09:49:50] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [09:51:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2013.codfw.wmnet to cluster codfw and group C [09:52:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2013.codfw.wmnet to cluster codfw and group C [09:53:04] !log rebalance ganeti codfw/C T323222 [09:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:07] T323222: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 [09:54:15] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10MoritzMuehlenhoff) 05Open→03Resolved The RAID rebuild has completed and the server has been readded to the cluster. [09:54:56] !log installing debootstrap updates from bullseye point release [09:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:14] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.36 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/863233 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [09:56:54] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [09:58:04] !log installing publicsuffix updates from bullseye/buster point releases [09:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:42] (03Abandoned) 10Hashar: Plugin to customize Zuul reports [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/858598 (owner: 10Hashar) [10:01:46] !log upload acme-chief 0.36 to apt.wm.o (bullseye) - T321309 [10:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:49] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [10:03:08] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [10:04:42] (03CR) 10David Caro: spicerack: add module injection support (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:04:46] (03CR) 10CI reject: [V: 04-1] setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 (owner: 10Volans) [10:04:50] (03CR) 10CI reject: [V: 04-1] spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:06:05] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38555/console" [puppet] - 10https://gerrit.wikimedia.org/r/863234 (https://phabricator.wikimedia.org/T324314) (owner: 10David Caro) [10:13:15] (03CR) 10JMeybohm: Add a new production image for otelcol (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [10:15:09] (03PS9) 10Clément Goubert: Add a new production image for otelcol [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) [10:15:26] (03CR) 10Clément Goubert: Add a new production image for otelcol (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [10:16:30] (03CR) 10JMeybohm: Add a new production image for otelcol (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [10:17:55] (03PS3) 10David Caro: harbor: fix interpolation in onlyif detecting compose [puppet] - 10https://gerrit.wikimedia.org/r/863234 (https://phabricator.wikimedia.org/T324314) [10:17:57] (03PS1) 10David Caro: harbor: add hiera default values [puppet] - 10https://gerrit.wikimedia.org/r/863242 (https://phabricator.wikimedia.org/T324314) [10:19:07] (03PS1) 10Jaime Nuche: mwdebug_deploy: remove deployment timer [puppet] - 10https://gerrit.wikimedia.org/r/863243 [10:20:50] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [10:21:20] PROBLEM - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: account=mw-media class=thumb https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [10:22:06] (03CR) 10Clément Goubert: [C: 03+1] mwdebug_deploy: remove deployment timer [puppet] - 10https://gerrit.wikimedia.org/r/863243 (owner: 10Jaime Nuche) [10:26:51] (03CR) 10Muehlenhoff: [C: 03+2] Switch to ganeti5004 in blackbox smoke tests [puppet] - 10https://gerrit.wikimedia.org/r/863236 (https://phabricator.wikimedia.org/T322048) (owner: 10Muehlenhoff) [10:28:01] (03CR) 10Volans: "reply inline, no new PS yet" [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:31:03] (03PS1) 10Elukey: kserve-inference: add a way to render "resources" in the right spot [deployment-charts] - 10https://gerrit.wikimedia.org/r/863247 (https://phabricator.wikimedia.org/T323624) [10:34:22] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti5001.eqsin.wmnet with reason: Remove from cluster for decom [10:34:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti5001.eqsin.wmnet with reason: Remove from cluster for decom [10:35:44] (03PS2) 10Elukey: kserve-inference: add a way to render "resources" in the right spot [deployment-charts] - 10https://gerrit.wikimedia.org/r/863247 (https://phabricator.wikimedia.org/T323624) [10:39:20] (03PS4) 10David Caro: harbor: fix interpolation in onlyif detecting compose [puppet] - 10https://gerrit.wikimedia.org/r/863234 (https://phabricator.wikimedia.org/T324314) [10:40:05] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38558/console" [puppet] - 10https://gerrit.wikimedia.org/r/863234 (https://phabricator.wikimedia.org/T324314) (owner: 10David Caro) [10:41:04] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/863247 (https://phabricator.wikimedia.org/T323624) (owner: 10Elukey) [10:44:03] (03Abandoned) 10David Caro: harbor: add hiera default values [puppet] - 10https://gerrit.wikimedia.org/r/863242 (https://phabricator.wikimedia.org/T324314) (owner: 10David Caro) [10:44:16] (03CR) 10FNegri: [C: 03+1] quota_increase: Fix issue with dashed quota names (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro) [10:44:52] (03CR) 10Muehlenhoff: [C: 03+2] Remove ganeti5001 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/863237 (https://phabricator.wikimedia.org/T322048) (owner: 10Muehlenhoff) [10:46:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug_deploy: remove deployment timer [puppet] - 10https://gerrit.wikimedia.org/r/863243 (owner: 10Jaime Nuche) [10:51:04] (03CR) 10Elukey: [C: 03+2] kserve-inference: add a way to render "resources" in the right spot [deployment-charts] - 10https://gerrit.wikimedia.org/r/863247 (https://phabricator.wikimedia.org/T323624) (owner: 10Elukey) [10:52:08] PROBLEM - Check for large files in client bucket on deploy1002 is CRITICAL: WARNING: large files in client bucket https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [10:56:19] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:57:52] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti5001.eqsin.wmnet [11:00:00] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 (owner: 10Volans) [11:02:06] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [11:02:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38559/console" [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:02:54] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:03:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38560/console" [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:04:59] 10SRE-tools, 10Infrastructure-Foundations, 10homer: Add CI to homer-deploy repo - https://phabricator.wikimedia.org/T277440 (10ayounsi) [11:05:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38561/console" [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:05:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/860572 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:09:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38562/console" [puppet] - 10https://gerrit.wikimedia.org/r/860573 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:09:49] (03CR) 10Awight: Fixup development tooling for wider compatibility (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 (owner: 10Stef Dunlap) [11:10:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/860573 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:11:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] harbor: fix interpolation in onlyif detecting compose [puppet] - 10https://gerrit.wikimedia.org/r/863234 (https://phabricator.wikimedia.org/T324314) (owner: 10David Caro) [11:11:52] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [11:12:20] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:13:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38563/console" [puppet] - 10https://gerrit.wikimedia.org/r/860574 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:13:17] Emperor: yeah it has been useful in the past to catch swiftrepl not working, the easiest is probably excluding thumbs from the check [11:14:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please let me know if you need assistance with reprepro." [puppet] - 10https://gerrit.wikimedia.org/r/862994 (owner: 10Vivian Rook) [11:15:54] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [11:16:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [11:16:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:16:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti5001.eqsin.wmnet [11:16:44] [11:19:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:19:55] (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/860574 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [11:20:12] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10MoritzMuehlenhoff) ganeti5001 has been decommissioned and can be unracked. [11:23:13] (03PS1) 10Muehlenhoff: Make puppetdb[12]003 puppetdb nodes [puppet] - 10https://gerrit.wikimedia.org/r/863255 [11:28:04] 10SRE, 10Traffic, 10serviceops, 10Platform Team Initiatives (API Gateway): Handle edge cache invalidation for the api gateway - https://phabricator.wikimedia.org/T324200 (10Joe) >>! In T324200#8436279, @daniel wrote: > Note that we only need active purging if/when we emit cache control headers that tell th... [11:33:10] (03PS1) 10Ilias Sarantopoulos: ml-services: fix env var type for asyncio workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/863257 (https://phabricator.wikimedia.org/T323624) [11:42:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:45:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:47:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:25] (03CR) 10ArielGlenn: "Because this change means that 8 or so jobs will be running at once, in addition to the jobs that already run on the 'misc dumps' snapshot" [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [11:47:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [11:51:30] (03PS7) 10Awight: [WIP] kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [11:51:48] (03CR) 10Awight: "PS 7: trivial rebase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [11:52:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:22] (03PS3) 10Volans: setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 [11:52:24] (03PS3) 10Volans: spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) [11:57:36] (03PS4) 10Volans: setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 [11:57:38] (03PS4) 10Volans: spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) [12:00:45] PROBLEM - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: account=mw-media class=thumb https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [12:02:28] (03CR) 10CI reject: [V: 04-1] spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [12:03:14] (03CR) 10Hokwelum: [C: 03+1] "This looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/856655 (owner: 10Ebernhardson) [12:06:35] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:39] RECOVERY - Check systemd state on an-presto1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:49] !log dropping all databases from db1133 [12:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:00] (03PS1) 10Jcrespo: miniloader: Draft small utilitiy to load a mydumper dump in an emergency [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/863264 (https://phabricator.wikimedia.org/T319383) [12:15:14] (03PS2) 10Jcrespo: miniloader: Draft small utilitiy to load a mydumper dump in an emergency [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/863264 (https://phabricator.wikimedia.org/T319383) [12:15:49] (03CR) 10CI reject: [V: 04-1] miniloader: Draft small utilitiy to load a mydumper dump in an emergency [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/863264 (https://phabricator.wikimedia.org/T319383) (owner: 10Jcrespo) [12:17:32] (03CR) 10Jbond: [C: 03+1] "lgtm minus the tox error" [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [12:17:55] (03CR) 10Slyngshede: [C: 03+2] ldap:client:utils remove outdated ldaplist util. [puppet] - 10https://gerrit.wikimedia.org/r/862833 (https://phabricator.wikimedia.org/T114063) (owner: 10Slyngshede) [12:18:14] (03CR) 10Jbond: [C: 03+1] setup.py: update dependencies and metadata [software/spicerack] - 10https://gerrit.wikimedia.org/r/863003 (owner: 10Volans) [12:19:42] (03PS20) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [12:22:27] PROBLEM - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: account=mw-media class=thumb https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [12:27:32] (03PS3) 10Jcrespo: miniloader: Draft small utilitiy to load a mydumper dump in an emergency [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/863264 (https://phabricator.wikimedia.org/T319383) [12:32:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [12:32:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863255 (owner: 10Muehlenhoff) [12:36:43] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [12:37:29] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:38:18] (03PS1) 10Muehlenhoff: postgresql::server: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/863286 (https://phabricator.wikimedia.org/T321783) [12:38:41] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:38:49] (03PS6) 10Jbond: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [12:40:05] (03PS5) 10Volans: spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) [12:40:54] (03CR) 10CI reject: [V: 04-1] cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [12:50:24] (03CR) 10Jbond: [C: 03+1] ldap:management rewrite modify-mfa to use Bitu. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [12:57:43] (03PS2) 10Jbond: apereo_cas: add OidcRegisteredService service support [puppet] - 10https://gerrit.wikimedia.org/r/863006 [13:02:07] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:35] (03PS2) 10Muehlenhoff: postgresql::server: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/863286 (https://phabricator.wikimedia.org/T321783) [13:04:39] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [13:06:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/863286 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [13:08:37] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [13:08:43] 10SRE, 10serviceops-collab, 10serviceops-radar: Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679 (10Aklapper) [13:10:05] (03PS7) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [13:10:53] (03CR) 10CI reject: [V: 04-1] cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [13:11:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10jnuche) `mwdeploy` has uid/gid 499 in prod hosts and 603 in beta and WMCS. Those literal values are not specified anywhere I could find (unlike reserved uid... [13:15:25] (03PS3) 10Jbond: apereo_cas: add OidcRegisteredService service support [puppet] - 10https://gerrit.wikimedia.org/r/863006 (https://phabricator.wikimedia.org/T311999) [13:15:27] (03PS1) 10Jbond: apereo_cas::services: drop mfa-u2f support [puppet] - 10https://gerrit.wikimedia.org/r/863292 (https://phabricator.wikimedia.org/T311999) [13:23:43] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10taavi) In WMCS, the `mwdeploy` user and group and their UID numbers are defined directly in LDAP. I assume this is for historical reasons (i.e. from the ver... [13:37:27] (03PS1) 10Muehlenhoff: durum: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) [13:37:29] (03PS1) 10Muehlenhoff: wikidough: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) [13:37:31] (03PS1) 10Muehlenhoff: oozie: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863296 (https://phabricator.wikimedia.org/T308013) [13:37:33] (03PS1) 10Muehlenhoff: hive: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863297 (https://phabricator.wikimedia.org/T308013) [13:37:35] (03PS1) 10Muehlenhoff: lvs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863298 (https://phabricator.wikimedia.org/T308013) [13:37:37] (03PS1) 10Muehlenhoff: zuul: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863299 (https://phabricator.wikimedia.org/T308013) [13:37:39] (03PS1) 10Muehlenhoff: presto: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863300 (https://phabricator.wikimedia.org/T308013) [13:37:41] (03PS1) 10Muehlenhoff: cache::kafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863301 (https://phabricator.wikimedia.org/T308013) [13:37:43] (03PS1) 10Muehlenhoff: quarry: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863302 (https://phabricator.wikimedia.org/T308013) [13:37:45] (03PS1) 10Muehlenhoff: envoy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863303 (https://phabricator.wikimedia.org/T308013) [13:37:47] (03PS1) 10Muehlenhoff: calico / dragonfly: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863304 (https://phabricator.wikimedia.org/T308013) [13:37:49] (03PS1) 10Muehlenhoff: Add SPDX headers to various base/IF profiles [puppet] - 10https://gerrit.wikimedia.org/r/863305 (https://phabricator.wikimedia.org/T308013) [13:40:11] (03PS8) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [13:41:51] PROBLEM - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: account=mw-media class=thumb https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [13:43:48] (03PS1) 10Stevemunene: Add an-presto1007 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/863327 (https://phabricator.wikimedia.org/T323783) [13:46:13] (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/863327 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [13:47:44] (03PS9) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [13:53:41] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38565/console" [puppet] - 10https://gerrit.wikimedia.org/r/863327 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [13:54:57] (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/863300 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:57:33] (03PS2) 10Muehlenhoff: durum: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863294 (https://phabricator.wikimedia.org/T308013) [13:57:42] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Add an-presto1007 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/863327 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [14:02:42] (03PS1) 10Muehlenhoff: Set role_contacts for apifeatureusage::logstash [puppet] - 10https://gerrit.wikimedia.org/r/863329 [14:06:03] (03PS10) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [14:06:09] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/863006 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [14:10:29] (03CR) 10Elukey: [C: 03+2] ml-services: fix env var type for asyncio workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/863257 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [14:11:06] (03CR) 10Filippo Giunchedi: "AFAIK this role specifically is owned by search platform, i.e. it was split to separate VMs from logstash for ownership purposes" [puppet] - 10https://gerrit.wikimedia.org/r/863329 (owner: 10Muehlenhoff) [14:11:43] PROBLEM - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: account=mw-media class=thumb https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [14:12:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [14:14:25] 10SRE, 10Legalpad: Explicitly mention npm in L3 - https://phabricator.wikimedia.org/T213971 (10LSobanski) a:03LSobanski [14:14:28] 10SRE, 10SRE Observability: rsyslog service should fail on configuration errors - https://phabricator.wikimedia.org/T290870 (10fgiunchedi) [14:16:09] (03PS1) 10Jbond: Gemfile: add sorted set for ruby 3.0 and above [puppet] - 10https://gerrit.wikimedia.org/r/863331 [14:19:21] (03CR) 10David Caro: [V: 03+1 C: 03+2] harbor: fix interpolation in onlyif detecting compose [puppet] - 10https://gerrit.wikimedia.org/r/863234 (https://phabricator.wikimedia.org/T324314) (owner: 10David Caro) [14:21:26] (03PS1) 10Ssingh: Revert "Revert "lvs5004: commission new LVS host (eqsin hardware refresh)"" [puppet] - 10https://gerrit.wikimedia.org/r/863349 [14:22:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/862369 (https://phabricator.wikimedia.org/T301167) (owner: 10Bking) [14:22:12] (03PS1) 10Ssingh: hiera: enable haproxy systemd hardening on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) [14:22:17] (03CR) 10David Caro: "recheck" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro) [14:22:50] (03PS2) 10Ssingh: hiera: enable haproxy systemd hardening on cp4045 [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) [14:23:52] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38566/console" [puppet] - 10https://gerrit.wikimedia.org/r/863332 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [14:24:51] (03CR) 10Jbond: [C: 03+2] Gemfile: add sorted set for ruby 3.0 and above [puppet] - 10https://gerrit.wikimedia.org/r/863331 (owner: 10Jbond) [14:25:05] 10SRE, 10Developer Productivity: Apache error log noise "Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1" on mwdebug1001 - https://phabricator.wikimedia.org/T236401 (10LSobanski) 05Open→03Resolved a:03LSobanski Couldn't find the error in the recent logs. Resolving, please reopen if thi... [14:25:24] (03PS11) 10Jbond: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [14:25:37] PROBLEM - Host contint1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:41] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:28:14] (03PS1) 10Ssingh: P:dns::auth::update: increase git clone timeout [puppet] - 10https://gerrit.wikimedia.org/r/863333 (https://phabricator.wikimedia.org/T324334) [14:28:25] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Damilare) Yes that's correct. Pardon my ignorance, but does this also mean I already have Turnilo access? [14:28:40] (03PS2) 10Jbond: apereo_cas::services: drop mfa-u2f support [puppet] - 10https://gerrit.wikimedia.org/r/863292 (https://phabricator.wikimedia.org/T311999) [14:29:17] (03CR) 10Andrea Denisse: [C: 03+2] admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) (owner: 10Andrea Denisse) [14:29:20] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui) @Damilare T319057 was closed, so I would assume so. Can you please test and let us know if you get any errors? [14:29:24] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38567/console" [puppet] - 10https://gerrit.wikimedia.org/r/863333 (https://phabricator.wikimedia.org/T324334) (owner: 10Ssingh) [14:29:39] (03PS3) 10Jbond: apereo_cas::services: drop mfa-u2f support [puppet] - 10https://gerrit.wikimedia.org/r/863292 (https://phabricator.wikimedia.org/T311999) [14:30:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38568/console" [puppet] - 10https://gerrit.wikimedia.org/r/863292 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [14:31:09] (03CR) 10BBlack: [C: 03+1] P:dns::auth::update: increase git clone timeout [puppet] - 10https://gerrit.wikimedia.org/r/863333 (https://phabricator.wikimedia.org/T324334) (owner: 10Ssingh) [14:32:34] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 (10BBlack) Comparison point: operations/dns is quite a bit different: total byte size is ~1/4 the size (~6MB vs the ~22MB size of netbox-exports), but has ~... [14:33:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Marostegui) 05In progress→03Resolved Added to `nda` group and created the kerberos principal. @dasm you should've received an email with further instructions. Also please a... [14:33:29] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Damilare) Thanks @Marostegui, looks like I do have the access. I was just able to sign in with my WikiTech LDAP. [14:33:35] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth::update: increase git clone timeout [puppet] - 10https://gerrit.wikimedia.org/r/863333 (https://phabricator.wikimedia.org/T324334) (owner: 10Ssingh) [14:34:16] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui) 05Open→03Resolved a:03Marostegui Excellent, closing this then! [14:36:12] (03CR) 10Ssingh: [C: 03+2] Revert "hiera: temporarily remove references to dns5004" [puppet] - 10https://gerrit.wikimedia.org/r/862917 (owner: 10Ssingh) [14:36:30] (03PS2) 10Ssingh: Revert "hiera: temporarily remove references to dns5004" [puppet] - 10https://gerrit.wikimedia.org/r/862917 [14:36:37] (03CR) 10Vgutierrez: [C: 03+1] Revert "Revert "lvs5004: commission new LVS host (eqsin hardware refresh)"" [puppet] - 10https://gerrit.wikimedia.org/r/863349 (owner: 10Ssingh) [14:38:16] (03PS1) 10Ilias Sarantopoulos: kserve-inference: change chart to enclose all env vars in quotes [deployment-charts] - 10https://gerrit.wikimedia.org/r/863334 [14:38:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns5004.wikimedia.org with OS buster [14:38:36] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster [14:41:24] (03CR) 10Elukey: kserve-inference: change chart to enclose all env vars in quotes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/863334 (owner: 10Ilias Sarantopoulos) [14:43:39] (03PS2) 10Ilias Sarantopoulos: kserve-inference: change chart to enclose all env vars in quotes [deployment-charts] - 10https://gerrit.wikimedia.org/r/863334 [14:43:50] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/863335 [14:44:30] (03CR) 10Ilias Sarantopoulos: kserve-inference: change chart to enclose all env vars in quotes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/863334 (owner: 10Ilias Sarantopoulos) [14:44:44] (03CR) 10Ssingh: [C: 03+2] Revert "Revert "lvs5004: commission new LVS host (eqsin hardware refresh)"" [puppet] - 10https://gerrit.wikimedia.org/r/863349 (owner: 10Ssingh) [14:44:52] (03PS2) 10Ssingh: Revert "Revert "lvs5004: commission new LVS host (eqsin hardware refresh)"" [puppet] - 10https://gerrit.wikimedia.org/r/863349 [14:45:14] (03CR) 10Elukey: kserve-inference: change chart to enclose all env vars in quotes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/863334 (owner: 10Ilias Sarantopoulos) [14:45:45] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:56] PROBLEM - Host 2001:df2:e500:1:103:102:166:8 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:59] (03CR) 10Muehlenhoff: Set role_contacts for apifeatureusage::logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863329 (owner: 10Muehlenhoff) [14:47:42] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:47:46] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Open LibreNMS port for netmon2002. [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [14:48:52] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host lvs5004.eqsin.wmnet with OS buster [14:49:02] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host lvs5004.eqsin.wmnet with OS buster [14:49:04] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:04] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/863335 (owner: 10Muehlenhoff) [14:50:34] sukhe: dns5004 it's you? [14:50:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:51:05] I see the erimage started earlier so assuming yes [14:51:12] volans: yep [14:51:13] (03CR) 10CI reject: [V: 04-1] kserve-inference: change chart to enclose all env vars in quotes [deployment-charts] - 10https://gerrit.wikimedia.org/r/863334 (owner: 10Ilias Sarantopoulos) [14:51:46] for those if needed you do an additional run of the downtime cookbook to downtime the IP-based icinga "hosts" [14:51:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [14:51:55] * you can do [14:52:14] ah [14:52:16] (03PS1) 10Muehlenhoff: Add role_contacts for role::analytics_test_cluster::presto::server [puppet] - 10https://gerrit.wikimedia.org/r/863340 [14:52:23] (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [14:52:30] I'm taking a look at the thanos query alerts [14:53:36] on that note, plesae ignore the recursive DNS alerts for a bit. the remaging of dns5004 should be resolved soon, hopefully!™ [14:53:54] 10SRE, 10Traffic, 10Patch-For-Review: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn' - https://phabricator.wikimedia.org/T324336 (10BBlack) [14:54:06] 10SRE, 10Traffic, 10Patch-For-Review: Replace edge cache conftool entries 'varnish-fe' and 'ats-tls' with singular 'cdn' - https://phabricator.wikimedia.org/T324336 (10BBlack) [14:54:21] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/863295/38569/" [puppet] - 10https://gerrit.wikimedia.org/r/863295 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:54:34] (03CR) 10Btullis: [C: 03+1] Add role_contacts for role::analytics_test_cluster::presto::server [puppet] - 10https://gerrit.wikimedia.org/r/863340 (owner: 10Muehlenhoff) [14:56:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [14:57:14] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/863334 (owner: 10Ilias Sarantopoulos) [14:57:23] (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [14:58:44] those were some heavy queries for big metrics say spanning 90d btw [14:58:45] (03CR) 10Muehlenhoff: [C: 03+2] Add role_contacts for role::analytics_test_cluster::presto::server [puppet] - 10https://gerrit.wikimedia.org/r/863340 (owner: 10Muehlenhoff) [14:59:02] not sure yet what was the offending query/queries [14:59:05] (03CR) 10Herron: wdqs: add grizzly dashboard for uptime (034 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [15:00:22] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Ottomata) It might be hard for us to help you, since I'm not aware of many folks that use Windows. Without more context, that error messagelooks like the... [15:03:00] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10Ottomata) @AnnWF @XenoRyet can you please fill in the "Reason for access" part of the task description? That will help us figure out what access you need. If the same reason as T324058, the... [15:03:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 (10Volans) Git a simple `git gc` we go from 22 MB to 4MB for a clone and 2.4MB for a bare clone. I'll run `git gc` everywhere right now. [15:06:19] !log run `git gc` on /srv/netbox-exports/dns.git on netbox[12]002 - T324334 [15:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:22] T324334: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 [15:09:42] (03CR) 10Elukey: [C: 03+2] kserve-inference: change chart to enclose all env vars in quotes [deployment-charts] - 10https://gerrit.wikimedia.org/r/863334 (owner: 10Ilias Sarantopoulos) [15:10:01] (03PS6) 10FNegri: cumin::target: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [15:10:42] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 (10Volans) On my home on `dns5002`, not triggered by Puppet: ` dns5002 0 15:09:41 ~ $ time git clone 'https://netbox-exports.wikimedia.org/dns.git' dns-te... [15:10:45] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:56] (03PS7) 10FNegri: cumin::target: Add support for cloudcumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [15:12:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [15:12:52] (03CR) 10CI reject: [V: 04-1] cumin::target: Add support for cloudcumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [15:13:52] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [15:15:16] volans: another issue that you may be help of :) [15:15:18] [7/50, retrying in 21.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title lvs5004 not found yet [15:15:46] yesterday this timed out for me [15:16:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [15:17:28] sukhe: so, that's polling puppetdb for the exported resources afer a NOOP puppet run [15:17:49] right, so yesterday I ran it with --new, since it was a new host [15:18:00] (03CR) 10Btullis: "This change is ready for review." (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [15:18:38] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:18:50] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:18:56] PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:18:57] ruh ruh [15:19:02] or ruh roh [15:19:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:19:33] (03CR) 10Andrew Bogott: Revert "oslo_messaging_rabbit: increase retry and backoff by a lot" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863090 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott) [15:19:52] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:19:58] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:20:06] ^^ I'm responding to WDQS alerts, will ACK shortly [15:20:14] ack [15:20:22] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:20:24] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1006.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs10 [15:20:24] .wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1006.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:20:58] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 7.508 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:21:01] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:21:14] page acked [15:21:26] inflatador: anything we can do to help? [15:21:34] <_joe_> rolling restart? [15:21:48] <_joe_> ah inflatador is on it already <3 [15:21:56] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 4.521 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:21:56] <_joe_> we're here to help in case :) [15:22:00] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [15:22:44] Thanks _joe_ and volans . Rolling restart is probably it...guessing it is a repeat of https://phabricator.wikimedia.org/T323620 [15:22:49] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [15:23:02] <_joe_> yeah I was guessing something similar indeed [15:23:28] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:24:02] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 3.378 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:25:46] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:01] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:26:07] wow, that was quick [15:26:22] RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.134 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:26:37] confirmed event resolved on VO [15:28:46] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [15:29:37] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [15:30:14] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [15:30:30] has anyone become iC? [15:30:30] inflatador: thanks for the super fast response! [15:31:16] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:32:46] <_joe_> brett: no need really, it's a known problem of a flawed service and inflatador was super fast in resolving it [15:33:04] RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [15:33:10] _joe_: Is that to say that an incident doc isn't needed? [15:33:33] <_joe_> I don't think it's worth it, but I'd let gehel and inflatador be the judges [15:33:48] <_joe_> maybe it's useful for them to keep track of their work on wdqs :) [15:33:58] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [15:34:00] hmm, new alert after restart [15:34:12] PROBLEM - WDQS SPARQL on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:34:21] <_joe_> ok, then I take it back, maybe we need an IC :) [15:34:46] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:34:58] <_joe_> yeah load averages are through the roof [15:35:10] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1015.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs10 [15:35:10] .wmnet are marked down but pooled: wdqs_80: Servers wdqs1015.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:35:34] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:35:41] I'll be IC (be advised, it's my first time) [15:35:42] PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:35:54] to what Host headers do WDQS respond? [15:35:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:36:01] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:36:15] * volans acked [15:36:16] brett thanks [15:36:18] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:36:20] * brett acked [15:36:24] I'm grabbing the stack traces now [15:36:26] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [15:36:30] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 4.554 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:36:32] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:37:21] <_joe_> inflatador, gehel if we suspect this is external traffic, can't we limit the servers which serve external direct traffic? [15:37:29] https://docs.google.com/document/d/1yLpK5yB9moJi1srCdgoEOBUn-9sUe1-ONJ4OGUgV4pE/edit?usp=sharing is the status doc [15:38:50] _joe_ I haven't done it personally but I think next steps would probably be https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Identifying_the_user_agent [15:39:04] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.078 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:39:06] _joe_: what do you mean? The public traffic is already routed to a different cluster [15:39:16] gehel: what's the public HOSTNAME for this traffic? [15:39:21] (assuming it's a bad query, we don't even know that yet) [15:39:24] query.wikidata.org [15:40:00] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [15:40:00] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 7.380 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:40:06] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [15:40:15] ^^ some of these recovered before I did a rolling restart [15:41:01] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:41:08] RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.994 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:42:02] (03PS1) 10Ssingh: lvs5004: update interface names in profile::lvs::interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/863367 (https://phabricator.wikimedia.org/T322048) [15:42:04] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.207 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:42:08] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.250 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:42:24] volans: all good on dns5004. I will get you the exact details later on the time but yes :) [15:42:38] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:42:41] watching load-15 ( https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&viewPanel=12 ) , typically if it's > 40, that means user impact [15:43:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:43:24] RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.255 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:43:27] here if hands needed for menial work or graph watching [15:43:29] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5004.wikimedia.org with OS buster [15:43:38] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5004.wikimedia.org with OS buster completed: - dns5004 (**PASS**)... [15:43:45] sukhe: ack thanks! [15:44:04] (03CR) 10Ssingh: [C: 03+2] lvs5004: update interface names in profile::lvs::interface_tweaks [puppet] - 10https://gerrit.wikimedia.org/r/863367 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [15:44:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:45:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:46:15] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add dns5004 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/862998 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [15:47:42] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [15:48:00] 10SRE, 10Infrastructure-Foundations, 10Traffic: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 (10BBlack) 05Open→03Resolved a:03Volans Confirmed the same. That was a simple and elegant fix, so I doubt there's any reason to pursue more-complex options! Thank you! [15:48:01] has the cause of the lag/CPU usage been established? [15:48:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:48:30] brett convo is happening in #mediawiki_security [15:48:45] !log homer "cr*-eqsin*" commit "running homer for Gerrit: 862998" [15:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:04] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs10 [15:49:04] .wmnet, wdqs1016.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:49:09] Chan is +i, I need an invite [15:49:13] sukhe: Can you invite me? [15:50:02] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:50:30] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1004.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs10 [15:50:30] .wmnet, wdqs1005.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:50:58] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:51:01] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:51:06] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:51:22] acked [15:51:30] PROBLEM - WDQS SPARQL on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:51:48] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:52:10] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:53:08] PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:53:48] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.743 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:54:19] (03PS23) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [15:55:02] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [15:55:04] RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 4.376 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:55:14] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [15:56:02] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 4.110 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:57:34] RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.860 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:58:32] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:58:54] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.106 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:59:06] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:59:13] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5004.eqsin.wmnet with reason: host reimage [16:00:32] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.669 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:01:01] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:02:50] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5004.eqsin.wmnet with reason: host reimage [16:02:55] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1005.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1012.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: w [16:02:55] Servers wdqs1012.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:03:40] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [16:04:57] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:08:03] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1013.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs10 [16:08:03] .wmnet, wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:08:18] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:43] PROBLEM - WDQS SPARQL on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:08:50] (03PS1) 10Ssingh: hiera: update lvs5004 primary interface [puppet] - 10https://gerrit.wikimedia.org/r/863371 [16:09:18] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:10:27] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:10:39] (03CR) 10Ssingh: [C: 03+2] hiera: update lvs5004 primary interface [puppet] - 10https://gerrit.wikimedia.org/r/863371 (owner: 10Ssingh) [16:11:01] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:07] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:11:25] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:12:35] PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:20:05] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:21:36] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [16:23:23] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 1.662 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:24:17] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.633 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:25:43] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.422 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:26:01] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:26:39] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 1.304 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:27:23] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs10 [16:27:23] .wmnet, wdqs1005.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1004.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:28:18] !log sukhe@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs5004.eqsin.wmnet with OS buster [16:28:29] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host lvs5004.eqsin.wmnet with OS buster completed: - lvs5004 (**WARN**)... [16:28:39] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host lvs5004.eqsin.wmnet with OS buster executed with errors: - lvs5004 (... [16:29:06] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [16:30:11] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 757 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:31:01] (ProbeDown) firing: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:31:07] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:31:47] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:33:15] PROBLEM - WDQS SPARQL on wdqs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:39:14] (03PS1) 10BBlack: cache_misc VCL: include requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/863375 [16:39:53] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:40:52] (03CR) 10BBlack: [C: 03+2] cache_misc VCL: include requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/863375 (owner: 10BBlack) [16:41:31] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:44:06] !log running agent on A:cp-text for https://gerrit.wikimedia.org/r/c/operations/puppet/+/863375 (requestctl for misc) [16:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:50] !log jnuche@deploy1002 Started scap: testing k8s deployment [16:45:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:46:39] (03CR) 10Btullis: [V: 03+2 C: 03+2] Update the spark and spark-operator images (035 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [16:47:05] (03CR) 10Filippo Giunchedi: Set role_contacts for apifeatureusage::logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/863329 (owner: 10Muehlenhoff) [16:47:53] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [16:49:03] !log (above agent runs completed on all text nodes for requestctl-for-misc patch) [16:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:29] !log bking@cumin2002 START - Cookbook sre.wdqs.restart [16:51:07] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.076 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:51:39] RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.060 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:51:55] RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 6.485 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:52:03] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:53:15] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:53:25] !log jnuche@deploy1002 Finished scap: testing k8s deployment (duration: 08m 35s) [16:53:31] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.118 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:54:19] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:54:23] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.117 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:55:03] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:55:05] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:55:19] RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.076 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:56:01] (ProbeDown) resolved: Service wdqs-ssl:443 has failed probes (http_wdqs-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#wdqs-ssl:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:56:42] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [17:03:27] PROBLEM - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: account=mw-media class=thumb https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [17:17:03] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [17:34:35] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add lvs5004 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/862944 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [17:36:05] !log homer "cr*-eqsin*" commit "running homer for Gerrit: 862944" [17:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:12] 10SRE, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Seen): Create Gerrit Administrator right policy - https://phabricator.wikimedia.org/T218686 (10LSobanski) [17:37:21] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:37:29] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:39:18] hmm, unexpected [17:39:19] 10.132.0.39 64600 0 0 0 0 2:09 Active [17:40:01] 10SRE, 10MediaWiki-extensions-Score, 10TestMe: Contrabass MIDI instrument is unusable - https://phabricator.wikimedia.org/T199356 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving based on the most recent update. Please reopen if this is still a problem. [17:49:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Volans) Manually powered off `mw1312`: ` cumin1001 0 17:46:27 ~ $ sudo ipmitool -I lanplus -H "10.65.1.83" -U root -E chassis power off Unable to read password from environment Password: Ch... [17:57:42] (03PS1) 10BBlack: profile::pybal: expand the lvs hostname regexen. [puppet] - 10https://gerrit.wikimedia.org/r/863379 (https://phabricator.wikimedia.org/T322048) [18:00:08] !log performed git gc on all (auth)dns hosts in /srv/git/netbox_dns_snippets - T324334 [18:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:12] T324334: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 [18:00:18] (03CR) 10Ssingh: [C: 03+1] "And thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/863379 (https://phabricator.wikimedia.org/T322048) (owner: 10BBlack) [18:01:37] !log volans@cumin1001 START - Cookbook sre.dns.netbox [18:01:49] (03CR) 10Ssingh: [C: 03+2] profile::pybal: expand the lvs hostname regexen. [puppet] - 10https://gerrit.wikimedia.org/r/863379 (https://phabricator.wikimedia.org/T322048) (owner: 10BBlack) [18:03:40] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Test run after git gc - volans@cumin1001" [18:05:31] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Test run after git gc - volans@cumin1001" [18:05:31] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:06:46] 10SRE, 10Infrastructure-Foundations, 10Traffic: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 (10Volans) And run `sudo cookbook -d sre.dns.netbox --force cbf1f50e2897f654f6f4f2ce639e7b9cf85e54cd -t T324334 "Test run after gc"` to ensure it all still works fine [18:13:13] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/863380 [18:13:54] (03CR) 10RLazarus: [C: 03+1] envoy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/863303 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [18:14:12] !log cr[23]-eqsin*: set routing-options static route 103.102.166.224/28 next-hop 10.132.0.39 [18:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10RLazarus) Thanks @Volans. DC ops, for mw1320, I wasn't able to manually shut it off -- please do just kill the power when you go in to unrack it. Thanks! [18:16:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Hghani) Hi, PS C:\WINDOWS\system32> ssh -v stat1005.eqiad.wmnet -L 8880:127.0.0.1:8880 OpenSSH_for_Windows_8.1p1, LibreSSL 3.0.2 debug1: Reading configur... [18:18:18] (03PS1) 10Volans: sre.hosts.reimage: call the Hiera cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/863381 [18:19:16] (03CR) 10Ssingh: [V: 03+1 C: 03+2] lvs5001: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/862946 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [18:20:21] !log decomm lvs5001: restarting pybal [18:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs5001.eqsin.wmnet with reason: downtimed, in the process of decom [18:22:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs5001.eqsin.wmnet with reason: downtimed, in the process of decom [18:22:53] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:57] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:23:22] ^ it's expected [18:28:32] (03PS1) 10Ssingh: lvs5004: set as high-traffic1 primary LVS and remove lvs4006 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/863382 (https://phabricator.wikimedia.org/T323830) [18:30:58] (03PS2) 10Ssingh: lvs5004: set as high-traffic1 primary LVS and remove lvs5001 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/863382 (https://phabricator.wikimedia.org/T323830) [18:35:21] (03PS3) 10Ssingh: lvs5004: set as high-traffic1 primary LVS and remove lvs5001 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/863382 (https://phabricator.wikimedia.org/T323830) [18:35:41] (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs5001 [homer/public] - 10https://gerrit.wikimedia.org/r/863383 (https://phabricator.wikimedia.org/T323830) [18:36:12] (03CR) 10BBlack: [C: 03+1] sites.yaml: remove decommissioned host lvs5001 [homer/public] - 10https://gerrit.wikimedia.org/r/863383 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [18:36:26] (03CR) 10BBlack: [C: 03+1] lvs5004: set as high-traffic1 primary LVS and remove lvs5001 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/863382 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [18:44:31] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs5001.eqsin.wmnet [18:45:52] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [18:49:03] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [18:51:40] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [18:52:19] 10SRE, 10Infrastructure-Foundations, 10Traffic: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 (10ssingh) While running the decommissioning cookbook for lvs5001, I ran into this: ` (13) authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4003-4004,5002,6001... [18:53:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs5001.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [18:53:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:53:11] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs5001.eqsin.wmnet [18:53:21] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs5001.eqsin.wmnet` - lvs5001.eqsin.wmnet... [18:54:36] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs5001 [homer/public] - 10https://gerrit.wikimedia.org/r/863383 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [18:55:59] !log homer "cr*-eqsin*" commit "running homer for Gerrit: 863383" [18:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:02] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [19:07:38] !log gitlab-runner* - upgrading gitlab-runner package version [19:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:32] (03CR) 10Ssingh: [C: 03+2] lvs5004: set as high-traffic1 primary LVS and remove lvs5001 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/863382 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [19:11:53] !log restart pybal on lvs5004 [19:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:42] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [19:27:00] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [19:28:38] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [19:36:54] !log fixed git checkout permissions T324334 [19:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:58] T324334: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 [19:37:29] !log volans@cumin1001 START - Cookbook sre.dns.netbox [19:37:40] sukhe: FYI &&& [19:37:44] ^^^ [19:38:41] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:39:51] !log volans@cumin1001 START - Cookbook sre.dns.netbox [19:41:39] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Force run after a permission problem - volans@cumin1001" [19:42:46] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Force run after a permission problem - volans@cumin1001" [19:42:46] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:43:27] 10SRE, 10Infrastructure-Foundations, 10Traffic: netbox-exports git cloning perf issues - https://phabricator.wikimedia.org/T324334 (10Volans) Sorry for the trouble, that was me indeed, I've fixed the permissions and run the `sre.dns.netbox` cookbook successfully: ` sudo cookbook sre.dns.netbox --force 0b5a2... [19:47:21] (03PS16) 10Hashar: Replace CI results table by Gerrit Check API [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) [19:47:27] (03PS6) 10Hashar: Boilerplate for QUnit testing [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 [20:14:33] (03PS1) 10RobH: Updating SKU list for 15th Generation servers [software] - 10https://gerrit.wikimedia.org/r/863390 [20:16:08] (03CR) 10RobH: [C: 03+2] Updating SKU list for 15th Generation servers [software] - 10https://gerrit.wikimedia.org/r/863390 (owner: 10RobH) [20:33:12] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10Dzahn) >>! In T163667#8438920, @jnuche wrote: > `mwdeploy` has uid/gid 499 in prod hosts and 603 in beta and WMCS. Those literal values are not specified an... [20:36:00] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10Dzahn) [20:38:04] 10SRE, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Seen): Create Gerrit Administrator right policy - https://phabricator.wikimedia.org/T218686 (10Dzahn) Probably needs input from Tim because he wrote the access policy after this came up in the past and there was a lot of discussion about it... [20:42:12] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:50:06] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [21:20:00] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [21:21:58] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:23:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:28:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:33:35] (03CR) 10RLazarus: wdqs: add grizzly dashboard for uptime (033 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [21:43:02] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:45:02] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:46:09] (03PS1) 10Vlad.shapik: WP: Add ability to specify filters such as sharpening and etc. for TIFF format [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/863399 (https://phabricator.wikimedia.org/T47212) [21:46:59] (03PS2) 10Vlad.shapik: WIP: Add ability to specify filters such as sharpening and etc. for TIFF format [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/863399 (https://phabricator.wikimedia.org/T47212) [21:48:53] (03CR) 10Herron: [C: 03+1] "LGTM pending follow up on rzl comments" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [22:09:14] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [22:11:10] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [23:20:46] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [23:22:04] (03PS1) 10BCornwall: varnish: Export runtime params for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/863406 (https://phabricator.wikimedia.org/T323723) [23:22:48] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [23:23:30] (03PS2) 10BCornwall: varnish: Export runtime params for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/863406 (https://phabricator.wikimedia.org/T323723) [23:36:01] (03CR) 10Dzahn: "well, the host names on cloud VPS instances look generally broken, as in -f does NOT return any FQDN:" [puppet] - 10https://gerrit.wikimedia.org/r/862908 (owner: 10Paladox) [23:36:43] (03CR) 10Dzahn: "Was there a ticket for that change?" [puppet] - 10https://gerrit.wikimedia.org/r/862908 (owner: 10Paladox) [23:38:27] (03CR) 10Dzahn: [C: 04-1] "I think if anything it's eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/862908 (owner: 10Paladox) [23:42:56] (03CR) 10Dzahn: [C: 03+2] "thanks! yea, this should be added, I will afterwards remove it from Horizon and confirm noop" [puppet] - 10https://gerrit.wikimedia.org/r/862909 (owner: 10Paladox) [23:43:09] (03PS3) 10Dzahn: phabricator-prod-1001: Set mysql master port and salve port [puppet] - 10https://gerrit.wikimedia.org/r/862909 (owner: 10Paladox) [23:43:26] (03PS4) 10Dzahn: phabricator-prod-1001: Set mysql master port and slave port (cloud) [puppet] - 10https://gerrit.wikimedia.org/r/862909 (owner: 10Paladox)