[01:16:26] FIRING: [13x] SystemdUnitFailed: apt-daily-upgrade.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:21:26] FIRING: [13x] SystemdUnitFailed: apt-daily-upgrade.service on ms-be2075:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:51:26] FIRING: [14x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:51:26] FIRING: [14x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:20] <_joe_> personal opinion: I see these alerts about prometheus-mysqld-exporter not restarting regularly. Is this auto restart really useful/valuable? [08:39:03] _joe_: The es one is becuase the host is being setup [08:39:05] I just finished it [08:39:08] so it will recover shortly [08:46:26] FIRING: [14x] SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es1045:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:50:43] Changing es5 master now [09:43:12] Going to depool pc5 and work there [11:30:05] PROBLEM - MariaDB sustained replica lag on s3 on db2149 is CRITICAL: 382 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2149&var-port=9104 [11:36:05] RECOVERY - MariaDB sustained replica lag on s3 on db2149 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2149&var-port=9104 [11:38:14] let me looking into this [11:38:25] No, that was part of an upgrade [11:38:43] ah I just saw the operations [11:44:41] I think I have to stop asking if "expected " [11:45:09] because p*ges are never expected and more if "surprised" or "should I worry" [11:46:06] semi-related, the alertmanager spam is starting to get a bit much [11:46:23] and sadly because of those workflow issues I mention, there is not much we can do atm [11:46:43] as notif_disabled doesn seem to work well for prometheus [11:47:28] I wonder if that could be implemented better using tags or something [11:48:10] I was very happy when Alex came up with notif_disbled for icinga, I wish we could have the same for prometheus [11:48:30] (or equivalent behaviour) [11:54:56] Yeah the disable notifications in puppet is great [11:55:03] Especially when provisioning hosts [11:56:46] <_joe_> it would mean we need to reengineer how we provision configuration to alertmanager, I guess [11:57:01] <_joe_> is there a task related to this? [11:57:27] From us, not that I know of [12:04:02] some alerts are ignored with this mechanism: [12:04:02] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/alertmanager/templates/alertmanager.yml.erb#38 [12:09:55] We do have a task somewhere relating to getting too many noisy alerts [12:10:25] T357333 [12:10:26] T357333: SystemdUnitFailed alerts are too noisy for data-persistence - https://phabricator.wikimedia.org/T357333 [12:12:18] FIRING: PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:21:56] I remember there were some recommendations in the industry to avoid making fields nullable if you can. IIRC it had something to do with indexing but my mind is blurry, does anyone remember something like this? [12:22:38] search is useless, probably full of AI slops [12:28:37] found some stuff [12:30:14] Going to depool pc4 [12:32:54] wohoo [12:34:23] Amir1: please note that business logic >>> performance in all cases [12:34:47] definitely [12:35:39] also I want to remember that InnoDB@5.6(?) fixed the most eggregious cases [12:36:04] I think nulls used to require a full byte or a full word or something like that [12:37:08] now it is a single bit [12:41:07] that's quite useful to know. Thanks! [12:43:49] I think there would be cases where null will be ignored allowing low cardinality indexes [12:44:38] but on the other side, it may take more storage. It is one of those things that probably logical correctes is way more important unless you hit a perf issue [12:44:50] *logical correctness [12:45:29] e.g. if most of the values are 0, with a few 1,2,3, index may be ignored [12:45:40] while null, 1,2,3 may use an index [12:46:04] but honestly, that is one of the things that have to be evaluated case by case [12:49:18] "Declare columns to be NOT NULL if possible. It makes SQL operations faster, by enabling better use of indexes and eliminating overhead for testing whether each value is NULL. You also save some storage space, one bit per column. If you really need NULL values in your tables, use them. Just avoid the default setting that allows NULL values in every column. " [12:49:28] quoting the manual: https://dev.mysql.com/doc/refman/8.4/en/data-size.html [13:02:13] Amir1: https://orchestrator.wikimedia.org/web/cluster/alias/pc6 getting there [13:02:22] wohoooo [13:03:25] ah thanks jynus that was what I've read too [13:59:39] Amir1: do you want me to add pc6 pooled or depooled? [13:59:54] I guess it is fine to pool it, as anything will write there until you tell MW? [13:59:56] let's pool it for now? [14:00:08] sure [14:00:21] MW will start writing immidietely [14:00:23] Amir1: MW will start automatically? [14:00:24] ah nice [14:00:40] I will send an email about it once it is confirmed [14:01:10] but don't pool pc7 yet since too many keys will be displaced (I think we still will be fine but latency will jump way too much) [14:01:17] no, pc7 isn't ready [14:05:47] Amir1: pc6 is now live [14:05:52] \o/ [14:06:08] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=pc6&var-role=All&from=now-1h&to=now [14:07:24] grafana is already happy about pc7, so forward looking https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=All&var-role=All [14:07:52] pc7 isn't live at all [14:07:56] it is not even configured [14:08:29] grafana can predict the future [14:08:31] those servers may be getting replication from others, but it is not live [14:09:39] yeah [14:15:22] the latency is a bit higher than the baseline but not by a wide margin: https://grafana.wikimedia.org/d/35WSHOjVk/application-servers-red-k8s?orgId=1&refresh=1m&viewPanel=9&from=now-12h&to=now [14:15:44] These hosts are totally cold [14:15:49] So not very surprising [14:16:10] yeah, that's expected [14:35:16] Random small annoyance, https://noc.wikimedia.org/dbconfig/eqiad.json should order the sections so I can check all PC or ES hosts easily [14:35:30] I fix it, it's too annoying [14:38:55] yeah thanks [15:44:11] marostegui: I'm deleting rows from user_properties in enwiki. It's much smaller scale than the previous ones that caused replag but it might cause tiny replags from time to time [15:44:35] Amir1: got it thanks [15:44:44] there isn't much we can't do :( [16:00:31] Emperor: created https://phabricator.wikimedia.org/T383903, lemme know if it good [16:00:35] *it is [16:10:52] marostegui: https://noc.wikimedia.org/dbconfig/eqiad.json (empty browser cache) [16:19:01] elukey: LGTM, thanks [17:56:11] PROBLEM - MariaDB sustained replica lag on s1 on db1163 is CRITICAL: 20.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1163&var-port=9104 [17:58:35] concerning [17:58:44] I wonder if that is dumps [18:02:27] I see a large amount of disk writes, but not why [18:08:11] RECOVERY - MariaDB sustained replica lag on s1 on db1163 is OK: (C)10 ge (W)5 ge 4.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1163&var-port=9104 [18:10:38] for tomorrow: there was a traffic-caused high load on s1 (that part is not a worry), but db1163 seemed to struggle more than the other to keep replication up to date. Maybe in needs a deeper hw or status review to see why it was having way less performance than the others [18:11:08] e.g. maybe a disk is close to failure, or needs a restart or something [18:11:24] didn't see anything obvious on logs or metrics [18:13:27] https://grafana.wikimedia.org/goto/vnR5jlDNR?orgId=1 [18:16:58] jynus: I mentioned a bit earlier, that's mine [18:17:10] it'll be done soon, there is no way around this [18:17:28] oh, sorry, I didn't see it [18:17:43] I saw not maintenance happening at the time, so I thought it was replication [18:18:05] *didn't see [18:18:13] yeah, I should have put it in maint calendar [18:18:50] actually, the fact that it happened on db1163 makes it interesting, even if it was intended [18:19:02] lower hw spec or something? [18:30:22] 715K left to be deleted [18:30:28] once every five seconds [18:30:33] *one 1K [18:30:52] so far the main delete batch doesn't take that long [21:55:27] PROBLEM - MariaDB sustained replica lag on s1 on db1235 is CRITICAL: 14.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1235&var-port=9104 [21:55:27] PROBLEM - MariaDB sustained replica lag on s1 on db1196 is CRITICAL: 11.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1196&var-port=9104 [21:55:29] PROBLEM - MariaDB sustained replica lag on s1 on db1232 is CRITICAL: 23.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1232&var-port=9104 [21:55:29] PROBLEM - MariaDB sustained replica lag on s1 on db1218 is CRITICAL: 10.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1218&var-port=9104 [21:56:27] RECOVERY - MariaDB sustained replica lag on s1 on db1235 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1235&var-port=9104 [21:56:27] RECOVERY - MariaDB sustained replica lag on s1 on db1196 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1196&var-port=9104 [21:56:29] RECOVERY - MariaDB sustained replica lag on s1 on db1218 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1218&var-port=9104 [21:57:29] RECOVERY - MariaDB sustained replica lag on s1 on db1232 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1232&var-port=9104 [23:21:05] enwiki is finally over so no further errors should show up