[07:49:45] Amir1: can I start reboots in s6 in eqiad and s3 in codfw? [08:17:26] I am going to reboot the non active proxies [08:25:16] Morning all, could someone +1 https://gerrit.wikimedia.org/r/c/labs/private/+/1151605 please? Adding an apus account to labs/private [08:58:41] (this has been done) [09:12:50] federico3: I think this sort of broke the upgrade cookbook: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1151219 because I used it (so I didn't add -t) and when the cookbook finished I got: https://phabricator.wikimedia.org/P76552 [09:15:04] ah I missed one of the task_comment entries, just a sec [09:15:16] thanks [09:16:52] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1151620 [09:17:37] I can run the upgrade myself with test-cookbook -c 1151620 or you can do it on your side so that we do a real end-to-end test (not dryrun) before merging? [09:17:54] Ok, I will do in a bit [09:18:37] let me open a cleanup task for better safety [09:25:19] federico3: confirm: test-cookbook -c 1151620 sre.mysql.upgrade -r "Upgrade" db2187.codfw.wmnet ? [09:26:30] yes but I would always do a dry run immediately before :) [09:26:41] test-cookbook -c 1151620 --dry-run sre.mysql.upgrade -r "Upgrade" db2187.codfw.wmnet [09:27:28] Yeah, asking to confirm, because I am getting https://phabricator.wikimedia.org/P76554 so I wasn't sure what was this about [09:32:39] uhm, no, that's an issue with the host not being found in the puppet query [09:35:22] e.g. sudo cookbook --dry-run sre.mysql.upgrade -r "dry run test" 'es2035.codfw.wmnet' this starts , while sudo cookbook --dry-run sre.mysql.upgrade -r "dry run test" 'db2187.codfw.wmnet' does not find the host [09:35:49] it's a (mariadb::sanitarium_multiinstance [09:36:56] Ah yes [09:37:07] I had the same issue yesterday wit db2186 and i forgot this is the same XD [09:37:10] My bad sorry [09:55:44] federico3: go for it! [10:03:18] federico3: Running the script to upgrade db2187, let's see how it goes! [10:09:27] federico3: all good [10:09:53] ok, thanks [10:10:14] CR merged [11:02:24] can I access prometheus.svc.eqiad.wmnet from gitlab CI ? [13:07:19] I am failing over m1 master [13:52:48] marostegui: https://phabricator.wikimedia.org/T384212#10862972 are you referring to creating one user or two? (one for show replica on all databases and another to write on the zarcillo db?) [13:54:42] federico3: I am busy at the moment with the x3 split [13:54:54] no worries [14:01:23] We are going to set s8 (wikidata) as RO for a few minutes to split x3 from it [14:41:57] zabe: hii, I killed your s8 migration script, since we set the db to read only and it was still writing, would you mind turning it on again when you have time? [14:42:34] ye [14:42:36] s [14:42:46] We should actually fix that [14:42:53] It is pretty dangerous [14:43:12] It got me quite confused for a few minutes [14:43:49] And we could've had a split brain [14:44:18] will the x3 split reach wiki replicas today or will that happen a bit later? [14:45:13] taavi: Actually, that is more complex than we think, we need to reimport all that into the sanitarium host and then into the wikireplicas, so I don't think it is happening today [14:45:34] What's the size of the data we are talking about? [14:46:36] Any of those tables need filtering? Or it is all public? [16:18:26] es1035 memory alert is flapping now on -operations [16:31:53] looking [16:34:16] marostegui: shall we prioritize the restarts e.g. tomorrow morning? [16:35:00] i've never done the security updates on es* - any pointer from Amir1? [16:36:48] would the restarts actually fix the problem or just postpone it? [16:37:52] federico3: if it's a replica of a RW section (es6-es7), the script should just work (just set the section to es6 or es7) for RO sections, it's a bit more complicated as there is no replication [16:38:06] afaik we don't know but at least we get out from the almost-emergency right now [16:41:00] es1035 is a master of a RW section, you need to do some work, it's complex [16:41:21] same goes for es2038 [16:41:51] you can't just use automation for them, you have to first stop writing to that section, then do a switchover in dbctl, then depool it and then you can restart it [16:41:57] I'm going to be afk this evening (in 10 mins) :-/ so far it looks like we might be ok for a little bit more but i'd rather get the restart done soon [16:42:58] would it make sense if you do the most urgent restarts and I follow/document the process? [17:04:46] (if it can help l'll be back in 3-4 hours) [17:28:15] federico3: note that I posted this earlier on the task: https://phabricator.wikimedia.org/T395294#10863529 [21:00:41] Amir1: are you around by any chance? [21:06:25] FIRING: [5x] SystemdUnitFailed: check-private-data.service on db2186:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed