[01:31:33] 10DBA, 10Operations, 10Privacy Engineering, 10Traffic, and 4 others: dbtree loads third party resources (from google.com/jsapi) - https://phabricator.wikimedia.org/T96499 (10Krinkle) [07:27:09] 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) Thanks @leila! @akosiaris does Tuesday 17th at 09:00 AM UTC work? [07:28:05] 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10akosiaris) >>! In T246098#5955323, @Marostegui wrote: > Thanks @leila! > @akosiaris does Tuesday 17th at 09:00 AM UTC work? Fine by me. [07:36:14] 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) Excellent - going to send calendar invite and block that time on the deployment page. [07:37:01] 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) [08:29:04] backup1001 will warn about a couple of empty backups [08:29:38] will run them manually when logical backups finish [08:30:19] Cool [08:57:51] should we announce the maintenance? [08:57:59] yeah [08:58:03] I will do in a sec [08:58:06] I was getting my env ready [08:58:46] there was a fetchblob error just now [08:59:00] it was me [08:59:03] ok [08:59:12] testing the non existing es5 for now [09:31:36] some monolog fatals on mwdebug1001 [09:31:52] mmm related? [09:32:05] probably not, probably deployment logging related [09:32:29] it is mwdebug only anyway [09:32:59] higher offeder is set-time-limit, but same rate as before deploy [09:33:04] *offender [09:33:30] yeah [09:33:31] so unless there is something weird on recentchanges, I would call it done [09:33:38] Yep [09:33:39] Going to close it [09:34:00] let's evaluate in a few hours if we want to go ahead today and set es3 as RO [09:34:27] let me update zarcillo's roles [09:34:46] thanks [09:34:47] or at least review them [09:35:30] yep, needs to add es5 masters [09:42:20] 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10Goal, 10Patch-For-Review: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 (10Marostegui) es5 has been enabled correctly. We will either disable es3 and set it... [09:42:51] 10DBA, 10Core Platform Team Workboards (Clinic Duty Team), 10Goal, 10Patch-For-Review: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only - https://phabricator.wikimedia.org/T246072 (10Marostegui) [09:45:43] one last thing before break [09:45:57] I see 3 hosts on collection failures [09:46:09] db1114 (mysql 8, known) [09:46:24] db1111 and db1078, those expected? [09:47:19] db1111 is kinda expected as it is the most loaded on s8 [09:47:36] not about load [09:47:45] oh sorry, collection [09:47:48] prometheus failures- however,I can see their graphs [09:47:48] I read connection [09:47:51] sorry [09:47:56] prometheus collection [09:48:03] those run 10.4 [09:48:04] so not sure what that means? [09:48:05] so yes [09:48:11] expected as some of them aren't available on 10.4 [09:48:12] but other have 10.4 [09:48:25] and don't error out? [09:48:26] https://phabricator.wikimedia.org/T244696 [09:48:48] let me show the panel so you can tell [09:49:08] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&fullscreen&panelId=4&from=1583812144231&to=1583833744231&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [09:49:33] that's strange yeah [09:49:39] I would expect the others 10.4 to also show up there [09:49:44] I thought there was other 10.4 hosts [09:49:49] ah, so it was not only me [09:49:49] yeah, there are lots of others [09:49:52] almost 1 per section [09:49:57] so why only those? [09:50:02] (rethorical question) [09:50:21] just pointing out and worth researching- could be a prometheus only issue [09:50:41] (not worried about metrics not being available) [09:50:56] were those the last ones installed? [09:51:05] or maybe the only ones that are not multiinstance? [09:51:22] no, those were actually one of the first ones [09:51:31] interesting [09:51:41] we havehttps://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&fullscreen&panelId=4&from=now-2d&to=now&edit&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [09:51:43] will take a break and may poke them a bit [09:51:44] in codfw we have more [09:51:51] so something with 10.4 [09:51:54] just to be clear- prometheus seems to be working there normally [09:51:54] let's see what I can find [09:52:12] see the mysql stats working normally [09:52:24] yeah [09:53:01] vs db1114 which I didn't setup propery and doesn't even appear [09:55:03] BTW, db1111 appear on test group [09:55:05] so all of them having errors could be realted to the missing metrics indeed [09:55:08] needs to be moved to core [09:55:09] no? [09:55:30] but aren't there missing metrics on the other 10.4 hosts? [09:55:39] yes [09:55:45] but all codfw show errors [09:55:52] but they do not show as "collection errors"? [09:56:11] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&fullscreen&panelId=4&from=now-2d&to=now&edit&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [09:56:11] they do [09:56:34] ah, I see [09:56:42] so it only shows the first ones!ª [09:56:42] for some reason, the multi instance on eqiad aren't showing up there [09:57:00] ok, so no worry for now, right? [09:57:03] root@db1111:~# curl localhost:9104/metrics | grep scrape | grep -v "#" [09:57:04] % Total % Received % Xferd Average Speed Time Time Time Current [09:57:04] Dload Upload Total Spent Left Speed [09:57:04] 100 189k 100 189k 0 0 11.5M 0 --:--:-- --:--:-- --:--:-- 11.5M [09:57:04] mysql_exporter_last_scrape_error 1 [09:57:08] it doesn't change [09:57:10] yep [09:57:10] from what I can see [09:57:19] we will research that at a later time [09:57:27] yeah [09:57:36] I thought there was something else weird [09:59:26] I ran "update instances set `group` = 'core' where name = 'db1111' limit 1;" [09:59:40] db1112 still on test-s4, for now? [10:00:02] and I promise to have a better edit interface soon(TM) [10:00:44] no, db1112 is in s3 now [10:00:49] ah [10:01:04] ok, what about db1077? [10:01:17] test-s4 yeah [10:01:31] db1077 got broken BBU and we changed it with db1112 [10:01:35] will update db1112 [10:01:37] I am guilty for not updating zarcillo [10:01:38] sorry :( [10:01:46] not your fault, it is not easy preciselly [10:04:39] let me know if you see something else weird: https://phabricator.wikimedia.org/P10674 [10:05:11] s10 may be missing, I guess [10:05:57] yeah, s10 is db1133 [10:06:00] which is still m5 [10:06:30] ok, not sure how to represent something being both s10 AND m5 :-D [10:06:42] nah, I think just leave it in m5 [10:06:45] ok [10:08:14] will run snapshot bacula transfers and will take a break [10:08:48] cool! [10:08:54] I am checking the exporter thing [10:15:57] db1111 started going green? [10:16:50] I have restarted the exporter to debug [10:17:01] I think I have found something [10:17:02] let me see [10:17:31] db1078 is red still, but db1111 started to "go green" [10:19:08] I think it is the same case as the bug you saw with the password string [10:19:18] I have created a /root/.my.cnf and the errors started to get gone [10:19:26] from what I can see on the --debug [10:19:37] let me confirm [10:19:44] created? but it should already be there, doesn't it? [10:20:06] there was no .my.cnf on 7-root [10:20:21] I am also seeing some errors with these lines: https://github.com/prometheus/mysqld_exporter/blob/master/mysqld_exporter.go#L142 and https://github.com/prometheus/mysqld_exporter/blob/master/mysqld_exporter.go#L119 [10:20:26] so it does look kinda related [10:20:28] there should not be .my.cnf on /root [10:20:35] on any host [10:20:46] I think you are trying to run it manually, which you shouldn't [10:20:49] as root [10:20:57] yeah, but that cleaned up the errors [10:21:07] I am seeing those errors on those lines [10:21:08] but prometheus shouldn't be run as root [10:21:14] yeah, it was just for debugging [10:21:15] only started through systemd [10:22:04] still, for debugging, "sudo -u prometheus" [10:22:36] if it fails, then there may be extra grants neede by the user [10:23:15] on the new prometheus exporter for buster [10:32:15] Mar 10 10:31:54 db1111 prometheus-mysqld-exporter[17089]: time="2020-03-10T10:31:54Z" level=debug msg="collect query: []" source="mysqld_exporter.go:142" [10:35:46] Interesting, I just changed the ARGs for the systecmt unit to run it as normally but with --log-level=debug and now it is getting green too (.my.cnf is removed) [10:38:01] And to make it even more interesting, just restarting the exporter (without touching ANYTHING) made db1078 to start decreasing its errors [10:38:10] I am thinking about this combination [10:38:21] something's weird there [10:38:28] host restarts to buster, the exporter starts running, we start mysql and then we do the mysql_upgrade [10:38:43] So maybe it reads from information_schema when the tables are still not updated to 10.4 schemas [10:39:23] maybe create a subticket about this for further research? [10:39:32] yeah [10:39:36] let me confirm it [10:39:43] because db1078 is now getting healthy [10:39:56] let me try on another recently upgraded host [10:40:02] just restarting the exporter [10:40:06] but what did you do, just restart? [10:40:15] just restart the exporter [10:40:21] umh [10:40:30] going to do that on https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&fullscreen&panelId=4&from=now-5m&to=now&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [10:40:32] let's see [10:40:56] it could be because the exporter starts running before we update the internal tables to 10.4 structure [10:45:16] I think it is confirmed: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&fullscreen&panelId=4&from=now-5m&to=now&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All all the hosts but db2125 have been restarted and they are decreasing errors [10:45:25] restarted as in mysqld exporter restarted just now [11:05:55] 10DBA, 10OTRS, 10Operations, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) Mail sent to wikitech-l: https://lists.wikimedia.org/pipermail/wikitech-l/2020-March/093175.html Deployments calendar window ad... [11:24:36] 10DBA: mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade - https://phabricator.wikimedia.org/T247290 (10Marostegui) [11:25:30] 10DBA: mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade - https://phabricator.wikimedia.org/T247290 (10Marostegui) p:05Triage→03Medium I still have to install some hosts with buster and 10.4 so I am going to confirm if that's the issue, by making sure to stop t... [11:31:30] 10DBA: mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade - https://phabricator.wikimedia.org/T247290 (10Marostegui) The last host pending to restart from the above dashboard: ` root@db2125:~# curl -s localhost:9104/metrics | grep scrape | grep -v "#" mysql_exporter_... [11:34:50] akosiaris: production. LastWritten: 2019-12-12 04:40:20, VolRetention: 7,776,000, Expires in: 61,891 [11:34:56] ^that is expected [11:36:27] but the oldest on Databases is from 2020-02-05 [11:37:14] with expiresIn: 4,809,863 [11:37:31] so maybe I just missed an update (only updated production) [11:37:39] we'll see how it behaves now [11:38:12] I have no alerts based on retention- I may add that soon [11:39:25] it is not recycling indded, a new volume just been created [12:00:05] marostegui: count me in for the db1132 scheduled restart for the debmonitor part, although there should be nothing to do there [13:43:27] thanks volans [13:43:44] is there a calendar invite I can accept by any chance? [13:43:54] just to make sure I don't forget [13:45:26] Ah sure [13:45:29] I can send it to you [13:47:43] thx! <3 [13:56:35] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db2121.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202003101355_marostegui_227323.log`. [14:23:02] 10DBA: Install 1 buster+10.4 host per section - https://phabricator.wikimedia.org/T246604 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2121.codfw.wmnet'] ` and were **ALL** successful. [14:28:37] 10DBA: mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade - https://phabricator.wikimedia.org/T247290 (10Marostegui) I have done the workflow in a different way, which is essentially, starting mysql and running mysql_upgrade BEFORE starting the exporter and: ` root@d... [14:38:16] so, aside from documentation, do you see any puppet patch fixing that^ [14:38:38] because we could add a dependency on systemd, but that would start mysqld [14:38:48] or we could add an if on puppet [14:38:53] Yeah, or just wait for mysqld to be running [14:38:55] but it may be worse than doing nothing [14:38:56] before starting [14:38:59] exactly [14:39:13] I will let you come up with a recommendation [14:39:15] I think it is not worth the effort, we can just issue a restart or something [14:39:16] yeah [14:39:29] I think I am going to add a page on wikitech with migration things [14:39:30] like that [14:39:39] or like that we have to remove stuff from grafana that no longer exists [14:39:57] or bugs we are waiting for upstream to fix before we can enable things back (like the optimizer flag) [14:43:46] jynus: I am thinking about setting es3 to RO tomorrow [14:43:57] as you wish [14:43:58] Not today [14:44:27] I am not sure [14:44:47] he he, I will help you whatever you decide :-D [14:46:23] marostegui: \o/ re: es5, btw [14:47:05] yaaay! [15:10:47] check_bacula.py dbprov2001.codfw.wmnet-Monthly-1st-Sun-Databases-mysql-srv-backups-snapshots-latest ~> 2020-03-10 10:14:44: type: F, status: T, bytes: 1,590,029,087,872 [15:15:49] https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=dbprov2001.codfw.wmnet-Monthly-1st-Sun-Databases-mysql-srv-backups-snapshots-latest&from=1583836961285&to=1583853319711 [15:16:06] around 3TB in total size with both dbprovs [15:29:25] I've added the column `standalone` to `sections` table, so those can be set on the database instead of being hardcoded on the script [15:30:01] select name FROM sections where standalone=1; ~> es1, es2, tendril [15:30:25] will send a patch to the prometheus config generator now [15:31:03] which will make https://gerrit.wikimedia.org/r/c/operations/puppet/+/576655 unnecesary [15:38:02] oooh thats useful yeah [15:40:11] this is not super necesary, but it will help understanding what information is there on tendril, that we need [15:42:22] yeah, definitely [15:42:39] And as soon as we start using it for more source of truth kind of thing, we will notice and change more things [15:56:44] I will put es1, es2, tendril, m4 and staging as "standalone" hosts (no replication) [15:57:05] es3 tomorrow yeah [15:57:09] yep [15:57:14] m4 is no longer, if you want to actually delete it [15:57:33] well, there is still db1108 around [15:57:36] yeah [15:57:42] but I think the dns is also gone [15:58:05] not a big deal just to make sure I have everything tendril has [15:58:11] yeah [15:58:20] can be deleted later [16:16:58] will test it later to make sure it is a noop: https://gerrit.wikimedia.org/r/c/operations/puppet/+/578547 [16:19:35] for now https://wikitech.wikimedia.org/wiki/MariaDB#Stretch_+_10.1_-%3E_Buster_+_10.4_known_issues [16:20:27] cool, thanks! [16:20:41] link it from the task [16:21:12] yep :) [16:25:59] 10DBA: mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade - https://phabricator.wikimedia.org/T247290 (10Marostegui) 05Open→03Resolved a:03Marostegui Added to the list of known issues: https://wikitech.wikimedia.org/wiki/MariaDB#Stretch_+_10.1_-%3E_Buster_+_10.... [16:26:01] 10DBA: Test MariaDB 10.4 in production - https://phabricator.wikimedia.org/T242702 (10Marostegui) [16:27:18] 10DBA: mysql_exporter_last_scrape_error flag on prometheus-mysqld-exporter increases after 10.4 upgrade - https://phabricator.wikimedia.org/T247290 (10Marostegui) [17:03:44] 10DBA, 10Operations, 10ops-codfw: backup2001.mgmt interface down - https://phabricator.wikimedia.org/T247324 (10jcrespo) [17:43:13] 10DBA, 10Operations, 10ops-codfw: backup2001.mgmt interface down - https://phabricator.wikimedia.org/T247324 (10Papaul) 05Open→03Resolved a:03Papaul Was cleaning up some old cables and accidentally disconnected the mgmt cable. All back up now [17:44:46] 10DBA, 10Operations, 10ops-codfw: backup2001.mgmt interface down - https://phabricator.wikimedia.org/T247324 (10jcrespo) We love when it is causes are simple as this! (vs a complex to debug issue). Do not worry at all! Thanks! [17:49:40] Habemus snapshots on bacula! [19:26:26] 10DBA, 10Cleanup, 10Data-Services, 10cloud-services-team (Kanban): Drop DB tables for now-deleted fixcopyrightwiki from production - https://phabricator.wikimedia.org/T246055 (10Bstorm) @Marostegui `maintain-views` can clean up views individually, but it won't drop the DB in any case. May as well just dro...