[04:56:07] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db1090.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [04:59:59] 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [05:16:46] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1090.eqiad.wmnet'] ` and were **ALL** successful. [08:03:43] 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui) [08:09:58] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db1113.eqiad.wmnet'] ` The log can be found in `/var/log/wmf... [08:27:27] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1113.eqiad.wmnet'] ` and were **ALL** successful. [08:35:36] 10DBA: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['dbproxy1013.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006170835_marostegui_17922.log`. [08:56:47] 10DBA: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1013.eqiad.wmnet'] ` and were **ALL** successful. [09:11:56] 10DBA: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10Marostegui) [09:24:15] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db2122.codfw.wmnet'] ` The log can be found in `/var/log/wmf... [09:52:17] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2122.codfw.wmnet'] ` and were **ALL** successful. [09:53:03] 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui) [10:09:46] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) As of today: 85/251 instances running MariaDB 10.4 [10:24:11] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db2091.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006171023_marostegui_4864... [10:46:52] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2091.codfw.wmnet'] ` and were **ALL** successful. [10:49:40] 2020-06-16 05:49:32 [ERROR] - Error connecting to database: Lost connection to MySQL server at 'reading authorization packet', system error: 2 "No such file or directory" [10:50:33] which host? [10:50:45] m2, I am guessing db1117 [10:51:24] That instance has not been touched in a long time I believe [10:51:47] I will retry, maybe network or other supurious error [10:53:26] nothing on syslog around that time [10:54:04] yeah, doesn't necesarilly mean a db error [10:54:09] could be network or client issue [10:54:13] or load issue [10:54:30] m2 has otrs, it is a bit of an outlier [10:54:44] but it is the first time a dump fails in a long time [10:58:51] 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui) s7 eqiad progress [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1136 [] db1127 [] db1125 [] db1116 [] db1101 [] db1098 [] db1094 []... [11:06:52] 10DBA, 10Patch-For-Review: Make partman/custom/no-srv-format.cfg work - https://phabricator.wikimedia.org/T251768 (10Marostegui) >>! In T251768#6214989, @Kormat wrote: > This can be closed when we change the default recipe for db hosts to reuse-parts ({T252027}), after we've been using it for a week or so more... [12:18:47] i just drafted 3 OKRs for myself as acting D/P manager in the OKR doc [12:18:56] they probably need more work/tweaks, but happy to take your input [12:19:13] and they indicate the direction I'm thinking (and we discussed yesterday), i hope [12:20:58] mark: I think they look good yeah, not too sure about the sentence "error rates" [12:21:52] right [12:21:55] agreed [12:22:17] changed it to "reduce time spent and mistakes" [12:22:38] not that I think there are many mistakes made in this team with manual work [12:22:47] but in general automation/standardization /should/ help with it [12:22:49] Definitely [12:23:03] Maybe something like: reduce the likelyhood of manual mistakes [12:23:05] and if it reduces time spent but increases mistakes, we've failed too ;) [12:23:05] or something like that? [12:23:09] haha [12:23:47] likelyhood is a bit too wishy washy to my liking, for an objective [12:24:56] yeah, I see [12:26:34] The word mistakes still sounds strange there [12:26:39] Not sure how to phrase it [12:29:55] perhaps "reduce time spent and potential for mistakes" [12:29:59] "Reduce manuel mistakes" isn't appropriate i guess ;) [12:30:04] mark: that sounds good! [12:30:06] although that's not too far from "likelyhood of" [12:30:10] but i still like it better [12:30:14] agree [12:30:26] kormat: :( [12:30:42] marostegui can just blame the developers [12:30:51] "they told me to deploy this schema change!" [12:31:08] * marostegui is currently scared enough of the MCR schema change [12:35:10] mark: as an alternative approach, what about "Reduce error-prone manual work"? [12:35:40] what about other manual work? [12:35:53] and other manuel work, to pile on to your theme ;) [12:36:03] haha [12:36:10] is there manual work that _isn't_ error-prone? [12:36:17] manuel work isn't error-prone [12:36:21] :) [12:36:37] but I guess, i'm not sure why that's better [12:36:46] and it does suggest we're focused primarily on error-proneness [12:36:52] while I think time spent is at least as important [12:36:58] just shouldn't make errors worse ;) [12:37:16] and of course, manuel work isn't fun [12:37:19] manual [12:37:20] sorry! [12:37:21] :) [12:37:23] hahaha [12:37:26] I need to change my name [12:37:47] * kormat giggles [12:46:22] marostegui: you're spanish, you've probably got lots of spare ones [12:46:38] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db2091.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006171246_marostegui_25785.log`. [12:47:46] kormat: I actually do have a second one, but never really use it [12:47:50] Maybe I should start now! [13:08:34] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2091.codfw.wmnet'] ` and were **ALL** successful. [13:18:25] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [14:25:48] anyone working on db1091? [14:26:13] NOT talking about db2091 [14:26:29] I am not working on db1091 [14:26:32] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, No puppet role has been assigned to this node. (file: /etc/puppet/manifests/site.pp, line: 2349, column: 9) on node db1091.eqiad.wmnet [14:26:35] what's up? [14:26:50] either deploy or network issue [14:26:57] I worked on db2091 [14:26:59] So maybe a typo [14:27:00] lket me check [14:27:17] yep [14:27:18] my bad [14:27:19] fixing [14:27:57] oh, so it was related to that [14:28:04] weird error, was it on site.pp? [14:28:22] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/606149/3/manifests/site.pp [14:28:30] yeah, I removed db1091 instead of db2091 [14:28:32] regex fail [14:29:07] "we meet again, achnemesis" :-D [14:29:53] this fixes it https://gerrit.wikimedia.org/r/606199 [14:30:07] see my comment on ops [14:30:14] yep [14:31:14] marostegui: oops, missed that [14:31:20] no worries! [14:31:50] the other ongoing warning we have is: Last snapshot for s6 at codfw (db2097.codfw.wmnet:3316) taken on 2020-06-17 04:33:27 is 501 GB, but previous one was 534 GB, a change of 6.3% [14:32:14] 6.3% size increase on s6 in 2 days, but only on codfw [14:32:29] let me see why [14:32:50] that is 30+ GB fast [14:32:58] ran puppet and issue fixed on db1091 [14:33:08] cool [14:33:18] jynus: possible because of an schema change running and the temporary table being caught in the middle of the backup? [14:33:25] jynus: s6 codfw is getting the MCR change [14:33:30] so maybe that's why [14:34:11] I missread the direction [14:34:18] it decreased by 30GB [14:34:28] then it is the schema change for sure [14:34:31] do you have a ticket/list of tables involved? [14:34:33] as we are shriking the table [14:34:38] revision and archive [14:34:38] so I can check on metadata? [14:34:40] thanks [14:34:48] I want to make sure we have not lost data [14:35:08] It could also be text, but I am dropping very old columns there not being used, so i don't think they are those [14:35:13] focus on archive and revision I would say [14:35:55] backups ids relevant are 6370 and 6320, I am checking the metadata of those tables [14:39:35] page hasn't changed a lot, but revision halved almost: phttps://phabricator.wikimedia.org/P11569 [14:39:48] https://phabricator.wikimedia.org/P11569 [14:42:04] I knew mcr was ongoing, but not that it would have such a dramatic decrease in size [14:42:27] good news! [14:42:40] this was only a backup warning- should I keep it or increase the threashold? [14:43:57] I think it was helpful [14:44:13] lets leave like that i think [14:44:24] it is a good threshold [20:50:20] 10DBA, 10Gerrit: Make sure `reviewdb-test` database (used forgerrit upgrade testing) gets torn down - https://phabricator.wikimedia.org/T255715 (10QChris) [20:50:47] 10DBA, 10Gerrit: Make sure `reviewdb-test` database (used forgerrit upgrade testing) gets torn down - https://phabricator.wikimedia.org/T255715 (10QChris) [20:50:50] 10DBA, 10Gerrit: Get a writable reviewdb clone to test Gerrit upgrade with - https://phabricator.wikimedia.org/T254516 (10QChris)