[00:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/984244 [00:38:38] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/984244 (owner: 10TrainBranchBot) [00:46:14] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:52] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/984244 (owner: 10TrainBranchBot) [02:38:38] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:38] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:57:42] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:58:34] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:04:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:05:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.268 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:12:50] (03PS1) 10Marostegui: pc2016: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/985084 [06:14:18] (03CR) 10Marostegui: [C: 03+2] pc2016: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/985084 (owner: 10Marostegui) [06:40:30] (03PS1) 10Ilias Sarantopoulos: ml-services: limit threads for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/985088 (https://phabricator.wikimedia.org/T348664) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231222T0700) [07:23:21] (03PS1) 10Alexandros Kosiaris: envoy: Make tracing configuration clearer [puppet] - 10https://gerrit.wikimedia.org/r/985102 (https://phabricator.wikimedia.org/T351566) [07:26:42] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/985102 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [07:33:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] envoy: Make tracing configuration clearer [puppet] - 10https://gerrit.wikimedia.org/r/985102 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [07:45:42] (03CR) 10Alexandros Kosiaris: Provide OpenTelemetry Collector and Port values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984817 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [07:47:28] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:47:48] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:56:23] (03PS7) 10Brouberol: external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231222T0800) [08:02:08] (03PS2) 10Alexandros Kosiaris: Provide OpenTelemetry Collector and Port values [puppet] - 10https://gerrit.wikimedia.org/r/984817 (https://phabricator.wikimedia.org/T351566) [08:02:23] (03CR) 10CI reject: [V: 04-1] Provide OpenTelemetry Collector and Port values [puppet] - 10https://gerrit.wikimedia.org/r/984817 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [08:08:24] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:08:28] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:50] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:52] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:32:41] (03PS3) 10Alexandros Kosiaris: Provide OpenTelemetry Collector and Port values [puppet] - 10https://gerrit.wikimedia.org/r/984817 (https://phabricator.wikimedia.org/T351566) [08:37:58] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] Provide OpenTelemetry Collector and Port values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984817 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [08:51:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] Provide OpenTelemetry Collector and Port values (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/984817 (https://phabricator.wikimedia.org/T351566) (owner: 10Alexandros Kosiaris) [08:57:16] (03CR) 10AikoChou: [C: 03+1] ml-services: limit threads for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/985088 (https://phabricator.wikimedia.org/T348664) (owner: 10Ilias Sarantopoulos) [09:08:55] (03PS3) 10Alexandros Kosiaris: Switch canaries to 1% OpenTelemetry sampling [puppet] - 10https://gerrit.wikimedia.org/r/984814 (https://phabricator.wikimedia.org/T351566) [09:30:22] (03PS1) 10Ayounsi: Disable homer daily diff on cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/985112 [09:30:42] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/985112 (owner: 10Ayounsi) [09:33:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/985112 (owner: 10Ayounsi) [09:34:17] (03CR) 10Ayounsi: [C: 03+2] Disable homer daily diff on cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/985112 (owner: 10Ayounsi) [09:40:56] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-web_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:26] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:55:00] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: limit threads for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/985088 (https://phabricator.wikimedia.org/T348664) (owner: 10Ilias Sarantopoulos) [09:55:56] (03Merged) 10jenkins-bot: ml-services: limit threads for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/985088 (https://phabricator.wikimedia.org/T348664) (owner: 10Ilias Sarantopoulos) [09:57:33] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [10:01:17] (03PS1) 10Ayounsi: Validators: enforce Trident3 port block consistency [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/985113 (https://phabricator.wikimedia.org/T303529) [10:15:04] (03PS1) 10Ilias Sarantopoulos: ml-services: log payload in revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/985114 (https://phabricator.wikimedia.org/T352958) [10:24:11] (03PS3) 10Lucas Werkmeister (WMDE): Add debug code for entity usage logic issue [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984848 (https://phabricator.wikimedia.org/T255706) [10:24:43] (03CR) 10Lucas Werkmeister (WMDE): "PS3 is a new debug approach, though it’ll have to rebased before we try it out next year." [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984848 (https://phabricator.wikimedia.org/T255706) (owner: 10Lucas Werkmeister (WMDE)) [10:29:59] (03CR) 10AikoChou: [C: 03+1] ml-services: log payload in revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/985114 (https://phabricator.wikimedia.org/T352958) (owner: 10Ilias Sarantopoulos) [10:34:27] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) [10:36:12] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: log payload in revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/985114 (https://phabricator.wikimedia.org/T352958) (owner: 10Ilias Sarantopoulos) [10:37:01] (03CR) 10CI reject: [V: 04-1] Add debug code for entity usage logic issue [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984848 (https://phabricator.wikimedia.org/T255706) (owner: 10Lucas Werkmeister (WMDE)) [10:37:36] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:46] (03Merged) 10jenkins-bot: ml-services: log payload in revertrisk [deployment-charts] - 10https://gerrit.wikimedia.org/r/985114 (https://phabricator.wikimedia.org/T352958) (owner: 10Ilias Sarantopoulos) [10:38:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:39:26] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:42:07] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:42:23] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:43:11] (03PS4) 10Lucas Werkmeister (WMDE): Add debug code for entity usage logic issue [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984848 (https://phabricator.wikimedia.org/T255706) [10:47:07] hi folks! https://phabricator.wikimedia.org/T353920 has been qualified as UBN; we think we have a fix (to be confirmed/+2'd), but we're wondering if it's an "emergency deployment" situation or not, can someone guide us here? [10:47:56] (probably needing to ping thcipriani and brennen as this week's conductor) [10:52:44] patch has been +2'd and is being merged https://gerrit.wikimedia.org/r/c/mediawiki/core/+/985120 [10:54:39] (03CR) 10Filippo Giunchedi: [C: 03+1] udp2log: amend demux.py to support the python3 runtime [puppet] - 10https://gerrit.wikimedia.org/r/984237 (https://phabricator.wikimedia.org/T353220) (owner: 10Cwhite) [11:04:05] (03CR) 10Filippo Giunchedi: udp2log: add simple benthos pipeline (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/984238 (https://phabricator.wikimedia.org/T353220) (owner: 10Cwhite) [11:18:06] (03PS1) 10Filippo Giunchedi: pontoon: copy post-receive hook from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/985125 (https://phabricator.wikimedia.org/T352640) [11:18:35] (03CR) 10CI reject: [V: 04-1] pontoon: copy post-receive hook from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/985125 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [11:22:04] (03PS1) 10Filippo Giunchedi: dummy [puppet] - 10https://gerrit.wikimedia.org/r/985146 [11:22:38] of course a mistake [11:23:44] (03PS2) 10Filippo Giunchedi: pontoon: copy post-receive hook from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/985125 (https://phabricator.wikimedia.org/T352640) [11:23:49] (03Abandoned) 10Filippo Giunchedi: dummy [puppet] - 10https://gerrit.wikimedia.org/r/985146 (owner: 10Filippo Giunchedi) [11:24:43] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: copy post-receive hook from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/985125 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [11:31:38] !log upload golang-github-intel-go-cpuid_0.0~git20210602.5747e5c-2+deb12u1 to apt.wm.o (bookworm) [11:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:12] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 4 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10C.Suthorn) >>! In T266155#8710759, @TheDJ wrote: >>>! In T266155#8707579, @doctaxon wrot... [12:08:16] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:08:24] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:12:44] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:12:52] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:40:47] (03PS1) 10Krinkle: Skin: Restore autoloading of mediawiki.ui.button styles [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/985032 (https://phabricator.wikimedia.org/T346469) [12:51:30] (03PS1) 10Alexandros Kosiaris: RESTBase: Hardcode no_worker patch to stop OOM [puppet] - 10https://gerrit.wikimedia.org/r/985154 (https://phabricator.wikimedia.org/T353456) [12:52:01] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/985154 (https://phabricator.wikimedia.org/T353456) (owner: 10Alexandros Kosiaris) [12:55:31] (03CR) 10MVernon: [C: 03+1] "Looks sensible to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/985154 (https://phabricator.wikimedia.org/T353456) (owner: 10Alexandros Kosiaris) [12:56:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/985154/2913/restbase2021.codfw.wmnet/index.html says that what we expect will happen, so mergin" [puppet] - 10https://gerrit.wikimedia.org/r/985154 (https://phabricator.wikimedia.org/T353456) (owner: 10Alexandros Kosiaris) [12:56:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/985154 (https://phabricator.wikimedia.org/T353456) (owner: 10Alexandros Kosiaris) [13:09:20] (03PS11) 10Majavah: keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [13:09:32] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [13:11:19] (03CR) 10Majavah: [C: 03+2] keystone haproxy: increase server timeout for admin service to 10m [puppet] - 10https://gerrit.wikimedia.org/r/984640 (https://phabricator.wikimedia.org/T353829) (owner: 10Andrew Bogott) [13:11:48] (03PS1) 10Reedy: Revert "Use Remex for DeduplicateStyles transform" [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/985033 (https://phabricator.wikimedia.org/T353920) [13:13:28] (03CR) 10Reedy: [C: 03+2] Revert "Use Remex for DeduplicateStyles transform" [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/985033 (https://phabricator.wikimedia.org/T353920) (owner: 10Reedy) [13:14:56] (03PS5) 10Majavah: Galera haproxy: fix handling of primary_host [puppet] - 10https://gerrit.wikimedia.org/r/984866 (owner: 10Andrew Bogott) [13:16:08] (03CR) 10Majavah: [C: 03+2] Galera haproxy: fix handling of primary_host [puppet] - 10https://gerrit.wikimedia.org/r/984866 (owner: 10Andrew Bogott) [13:33:08] (03Merged) 10jenkins-bot: Revert "Use Remex for DeduplicateStyles transform" [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/985033 (https://phabricator.wikimedia.org/T353920) (owner: 10Reedy) [13:37:11] !log reedy@deploy2002 Started scap: T353920 [13:37:29] T353920: HTML entity in wikitext are being over parsed - https://phabricator.wikimedia.org/T353920 [13:44:58] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@4f56fff]: (no justification provided) [13:45:14] !log reedy@deploy2002 Finished scap: T353920 (duration: 08m 02s) [13:45:25] T353920: HTML entity in wikitext are being over parsed - https://phabricator.wikimedia.org/T353920 [14:01:29] (03PS6) 10Slyngshede: Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 [14:01:55] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@4f56fff]: (no justification provided) (duration: 16m 57s) [14:03:19] (03CR) 10CI reject: [V: 04-1] Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [14:05:21] (03PS2) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [14:06:19] (03PS3) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [14:07:52] (03PS7) 10Slyngshede: Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 [14:08:36] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 4 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10TheDJ) >>! In T266155#9423263, @C.Suthorn wrote: > It would make a much better UX, if in... [14:08:44] (03CR) 10CI reject: [V: 04-1] Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [14:09:40] (03CR) 10CI reject: [V: 04-1] Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 (owner: 10Slyngshede) [14:10:42] (03PS4) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [14:11:24] (03PS5) 10Slyngshede: Package Debmonitor server as .deb [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 [14:12:10] 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10Jclark-ctr) @MoritzMuehlenhoff replacement drive was sent by dell. Can this be replaced at any time? [14:13:11] (03PS8) 10Slyngshede: Changes to Python infrastucture to help building Debian package. [software/debmonitor] - 10https://gerrit.wikimedia.org/r/982799 [14:14:51] (03PS1) 10Ladsgroup: Add virtual domain config for reading lists extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985160 (https://phabricator.wikimedia.org/T353948) [14:15:56] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) @Andrew Dell would like to replace cpu and reapply thermal paste they would like to preform service today is server still down? [14:23:01] (03CR) 10Slyngshede: Package Debmonitor server as .deb (031 comment) [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/981300 (owner: 10Slyngshede) [14:28:26] 10SRE, 10ops-eqiad: SMART errors on ganeti1031 - https://phabricator.wikimedia.org/T353324 (10MoritzMuehlenhoff) >>! In T353324#9423490, @Jclark-ctr wrote: > @MoritzMuehlenhoff replacement drive was sent by dell. Can this be replaced at any time? In principle, yes. But let's do it in January, just ping me be... [14:36:12] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:58] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:22] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@f0c9f9f]: (no justification provided) [14:53:39] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:54] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@f0c9f9f]: (no justification provided) (duration: 09m 32s) [14:58:49] !log jgiannelos@deploy2002 Started deploy [restbase/deploy@5f2756a]: (no justification provided) [15:12:12] RECOVERY - Host mw2448 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [15:13:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Aki Nakanishi - https://phabricator.wikimedia.org/T353363 (10herron) p:05Triage→03Medium [15:14:19] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10herron) [15:16:16] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm Got the part replaced. The bios settings got wiped when the CMOS battery got swapped out. I believe everything is back to how it should be. the idra... [15:16:26] !log jgiannelos@deploy2002 Finished deploy [restbase/deploy@5f2756a]: (no justification provided) (duration: 17m 36s) [15:18:21] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10taavi) >>! In T353408#9423493, @Jclark-ctr wrote: > @Andrew Dell would like to replace cpu and reapply thermal paste they would like to preform service t... [15:38:01] thanks ihurbain and reedy for the speedy fix right before the holidays <3 [15:38:59] happy it was possible :) [15:40:37] (03CR) 10JMeybohm: "Nice! I would add some .fixtures files to have more CI coverage. Also I'm not sure what to do if there are no endpoint IPs. Maybe we shoul" [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [15:40:51] I’d like to test something on mwdebug (temporarily deploy a Gerrit change there, then discard it again) [15:40:59] is that okay on a no-deploy Friday? WDYT? [15:47:53] (03PS5) 10Lucas Werkmeister (WMDE): Add debug code for entity usage logic issue [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984848 (https://phabricator.wikimedia.org/T255706) [15:50:48] (03PS6) 10Lucas Werkmeister (WMDE): Add debug code for entity usage logic issue [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984848 (https://phabricator.wikimedia.org/T255706) [15:51:20] (03CR) 10Lucas Werkmeister (WMDE): "PS5 is another new debug approach (PS3/4 are obsolete) – see T255706#9423657. (We’ll still have to rebase this on the new wmf branch next " [extensions/Wikibase] (wmf/1.42.0-wmf.9) - 10https://gerrit.wikimedia.org/r/984848 (https://phabricator.wikimedia.org/T255706) (owner: 10Lucas Werkmeister (WMDE)) [15:57:49] Lucas_WMDE: seems harmless to me and I know you've got the power, sure [15:58:57] scap pull should still work fine there [16:05:45] lmata: I think b.black is on vacation... [16:06:25] fabfur: they have been all week (hence my mail to sre@ on Monday asking if anyone from Americas TZ could cover) [16:08:57] thcipriani: thanks! but I think I’ll wait for the new year after all [16:09:14] (I was hoping to investigate this with codders [new WMDE colleague] and MichaelG_WMDE but codders just left) [16:10:01] Emperor: tnx [16:10:02] Lucas_WMDE: gotcha, that sounds good, too :) [16:10:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983896 (https://phabricator.wikimedia.org/T353363) (owner: 10Herron) [16:12:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/983753 (https://phabricator.wikimedia.org/T353370) (owner: 10Herron) [16:33:42] (03CR) 10Krinkle: [C: 03+2] Skin: Restore autoloading of mediawiki.ui.button styles [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/985032 (https://phabricator.wikimedia.org/T346469) (owner: 10Krinkle) [16:45:34] (03PS1) 10Ahmon Dancy: Revert "logspam: Consolidate Actor name can not be empty for 0 and..." [puppet] - 10https://gerrit.wikimedia.org/r/985034 [16:50:10] (03CR) 10Krinkle: [C: 03+1] Revert "logspam: Consolidate Actor name can not be empty for 0 and..." [puppet] - 10https://gerrit.wikimedia.org/r/985034 (owner: 10Ahmon Dancy) [16:53:23] (03Merged) 10jenkins-bot: Skin: Restore autoloading of mediawiki.ui.button styles [core] (wmf/1.42.0-wmf.10) - 10https://gerrit.wikimedia.org/r/985032 (https://phabricator.wikimedia.org/T346469) (owner: 10Krinkle) [17:25:55] (03CR) 10Herron: [C: 03+2] admin: add rkhan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/983753 (https://phabricator.wikimedia.org/T353370) (owner: 10Herron) [17:28:33] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10AnnWF) [17:28:46] !log krinkle@deploy2002 Synchronized php-1.42.0-wmf.10/includes/skins/Skin.php: Ice6d6cfb442e4b3edb7f3002799780c76d7e0b0e (duration: 06m 25s) [17:29:47] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10taavi) [17:30:59] (03PS2) 10Herron: admin: add anakanishi to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/983896 (https://phabricator.wikimedia.org/T353363) [17:32:44] (03CR) 10Herron: [C: 03+2] admin: add anakanishi to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/983896 (https://phabricator.wikimedia.org/T353363) (owner: 10Herron) [17:38:16] 10SRE, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Riddy Khan - https://phabricator.wikimedia.org/T353370 (10herron) 05Open→03Resolved a:03herron The requested access has been granted and will be live within the next 30 minutes. Transition... [17:40:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Aki Nakanishi - https://phabricator.wikimedia.org/T353363 (10herron) 05Open→03Resolved a:03herron The requested access has been granted and will be live within the nex... [17:42:35] (03PS2) 10Herron: admin: add damilare to 'deployment' [puppet] - 10https://gerrit.wikimedia.org/r/984863 (https://phabricator.wikimedia.org/T353838) [17:43:39] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Damilare Adedoyin - https://phabricator.wikimedia.org/T353838 (10herron) [17:47:09] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:47:33] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10herron) [17:54:28] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10herron) Hi @AnnWF could you please confirm that the developer account name is correct? The account named wfan is not associated with the email address in the description. Thanks in adva... [17:57:09] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [17:57:59] (03CR) 10Dzahn: [C: 03+2] Revert "logspam: Consolidate Actor name can not be empty for 0 and..." [puppet] - 10https://gerrit.wikimedia.org/r/985034 (owner: 10Ahmon Dancy) [18:02:22] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10Volans) @Jhancock.wm the easiest and safest way to reconfigure a BIOS is to run the `sre.hosts.provision` cookbook like it was a new host just with some options to skip unnecessary steps like `--no-dhc... [18:07:24] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10AnnWF) >>! In T353958#9423872, @herron wrote: > Hi @AnnWF could you please confirm that the developer account name is correct? The account named wfan is not associated with the email add... [18:07:33] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10AnnWF) [18:13:01] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10herron) >>! In T353958#9423892, @AnnWF wrote: >>>! In T353958#9423872, @herron wrote: >> Hi @AnnWF could you please confirm that the developer account name is correct? The account named... [18:23:33] 10SRE, 10SRE-Access-Requests: Grant Access to analytics-privatedata-users for mcastro-wmf - https://phabricator.wikimedia.org/T353273 (10herron) 05Stalled→03Invalid Closing as invalid due to inactivity, please reopen with the requested updates when ready to proceed. Thanks! [18:25:45] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10AnnWF) [18:25:59] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10AnnWF) >>! In T353958#9423898, @herron wrote: >>>! In T353958#9423892, @AnnWF wrote: >>>>! In T353958#9423872, @herron wrote: >>> Hi @AnnWF could you please confirm that the developer acc... [18:29:18] (03PS1) 10Herron: admin: add wfan219 to deployment [puppet] - 10https://gerrit.wikimedia.org/r/985331 (https://phabricator.wikimedia.org/T353958) [18:33:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10herron) Great thanks, I've just uploaded the patch. Next step will be a few approvals: @XenoRyet could you please review/approve as manager? @thcipriani could you... [18:46:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/984863 (https://phabricator.wikimedia.org/T353838) (owner: 10Herron) [18:51:29] (03PS10) 10Brouberol: external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [18:51:31] (03CR) 10Brouberol: "Thanks for the review! I've addressed most of your suggestions/feedbacks. The remaining one is whether dual stack would work as-is." [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [19:46:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for wenjun fan - https://phabricator.wikimedia.org/T353958 (10thcipriani) Approved [19:54:51] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10Jhancock.wm) thank you. I'll remember that one. [20:15:37] (03CR) 10Herron: [C: 03+2] admin: add damilare to 'deployment' [puppet] - 10https://gerrit.wikimedia.org/r/984863 (https://phabricator.wikimedia.org/T353838) (owner: 10Herron) [20:21:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Damilare Adedoyin - https://phabricator.wikimedia.org/T353838 (10herron) 05Open→03Resolved a:03herron The requested access has been granted and will be live within the next 30 minutes. Transitioning to resolved, but p... [20:49:12] 10SRE, 10ops-codfw, 10serviceops: mw2448.codfw.wmnet is down - https://phabricator.wikimedia.org/T353679 (10sbassett) Any objections to making this public now? [21:10:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver POST/302: 2.793473063197654s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [21:15:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw appserver POST/302: 2.4594605679135544s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee [21:18:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver POST/302: 2.6719965265986083s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede [21:23:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw appserver POST/302: 2.3200974075776153s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee [22:34:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver POST/302: 5.623385411120031s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:39:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw appserver POST/302: 4.6725273144158725s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee