[04:56:07] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db1090.eqiad.wmnet'] ` The log can be found in `/var/log/wmf...
[04:59:59] <wikibugs>	 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui)
[05:16:46] <wikibugs>	 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1090.eqiad.wmnet'] `  and were **ALL** successful.
[08:03:43] <wikibugs>	 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui)
[08:09:58] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db1113.eqiad.wmnet'] ` The log can be found in `/var/log/wmf...
[08:27:27] <wikibugs>	 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1113.eqiad.wmnet'] `  and were **ALL** successful.
[08:35:36] <wikibugs>	 10DBA: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['dbproxy1013.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006170835_marostegui_17922.log`.
[08:56:47] <wikibugs>	 10DBA: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1013.eqiad.wmnet'] `  and were **ALL** successful.
[09:11:56] <wikibugs>	 10DBA: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10Marostegui)
[09:24:15] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db2122.codfw.wmnet'] ` The log can be found in `/var/log/wmf...
[09:52:17] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2122.codfw.wmnet'] `  and were **ALL** successful.
[09:53:03] <wikibugs>	 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui)
[10:09:46] <wikibugs>	 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) As of today: 85/251 instances running MariaDB 10.4
[10:24:11] <wikibugs>	 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db2091.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006171023_marostegui_4864...
[10:46:52] <wikibugs>	 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2091.codfw.wmnet'] `  and were **ALL** successful.
[10:49:40] <jynus>	 2020-06-16 05:49:32 [ERROR] - Error connecting to database: Lost connection to MySQL server at 'reading authorization packet', system error: 2 "No such file or directory"
[10:50:33] <marostegui>	 which host?
[10:50:45] <jynus>	 m2, I am guessing db1117
[10:51:24] <marostegui>	 That instance has not been touched in a long time I believe
[10:51:47] <jynus>	 I will retry, maybe network or other supurious error
[10:53:26] <marostegui>	 nothing on syslog around that time
[10:54:04] <jynus>	 yeah, doesn't necesarilly mean a db error
[10:54:09] <jynus>	 could be network or client issue
[10:54:13] <jynus>	 or load issue
[10:54:30] <jynus>	 m2 has otrs, it is a bit of an outlier
[10:54:44] <jynus>	 but it is the first time a dump fails in a long time
[10:58:51] <wikibugs>	 10DBA, 10Core Platform Team: text table still has old_* fields and indexes on some hosts - https://phabricator.wikimedia.org/T250066 (10Marostegui) s7 eqiad progress  [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1003 [] db1136 [] db1127 [] db1125 [] db1116 [] db1101 [] db1098 [] db1094 []...
[11:06:52] <wikibugs>	 10DBA, 10Patch-For-Review: Make partman/custom/no-srv-format.cfg work - https://phabricator.wikimedia.org/T251768 (10Marostegui) >>! In T251768#6214989, @Kormat wrote: > This can be closed when we change the default recipe for db hosts to reuse-parts ({T252027}), after we've been using it for a week or so more...
[12:18:47] <mark>	 i just drafted 3 OKRs for myself as acting D/P manager in the OKR doc
[12:18:56] <mark>	 they probably need more work/tweaks, but happy to take your input
[12:19:13] <mark>	 and they indicate the direction I'm thinking (and we discussed yesterday), i hope
[12:20:58] <marostegui>	 mark: I think they look good yeah, not too sure about the sentence "error rates"
[12:21:52] <mark>	 right
[12:21:55] <mark>	 agreed
[12:22:17] <mark>	 changed it to "reduce time spent and mistakes"
[12:22:38] <mark>	 not that I think there are many mistakes made in this team with manual work
[12:22:47] <mark>	 but in general automation/standardization /should/ help with it
[12:22:49] <marostegui>	 Definitely
[12:23:03] <marostegui>	 Maybe something like: reduce the likelyhood of manual mistakes
[12:23:05] <mark>	 and if it reduces time spent but increases mistakes, we've failed too ;)
[12:23:05] <marostegui>	 or something like that?
[12:23:09] <marostegui>	 haha
[12:23:47] <mark>	 likelyhood is a bit too wishy washy to my liking, for an objective
[12:24:56] <marostegui>	 yeah, I see
[12:26:34] <marostegui>	 The word mistakes still sounds strange there
[12:26:39] <marostegui>	 Not sure how to phrase it
[12:29:55] <mark>	 perhaps "reduce time spent and potential for mistakes"
[12:29:59] <kormat>	 "Reduce manuel mistakes" isn't appropriate i guess ;)
[12:30:04] <marostegui>	 mark: that sounds good!
[12:30:06] <mark>	 although that's not too far from "likelyhood of"
[12:30:10] <mark>	 but i still like it better
[12:30:14] <marostegui>	 agree
[12:30:26] <marostegui>	 kormat: :(
[12:30:42] <Reedy>	 marostegui can just blame the developers
[12:30:51] <Reedy>	 "they told me to deploy this schema change!"
[12:31:08] * marostegui is currently scared enough of the MCR schema change
[12:35:10] <kormat>	 mark: as an alternative approach, what about "Reduce error-prone manual work"?
[12:35:40] <mark>	 what about other manual work?
[12:35:53] <mark>	 and other manuel work, to pile on to your theme ;)
[12:36:03] <marostegui>	 haha
[12:36:10] <kormat>	 is there manual work that _isn't_ error-prone?
[12:36:17] <mark>	 manuel work isn't error-prone
[12:36:21] <kormat>	 :)
[12:36:37] <mark>	 but I guess, i'm not sure why that's better
[12:36:46] <mark>	 and it does suggest we're focused primarily on error-proneness
[12:36:52] <mark>	 while I think time spent is at least as important
[12:36:58] <mark>	 just shouldn't make errors worse ;)
[12:37:16] <mark>	 and of course, manuel work isn't fun
[12:37:19] <mark>	 manual
[12:37:20] <mark>	 sorry!
[12:37:21] <mark>	 :)
[12:37:23] <marostegui>	 hahaha
[12:37:26] <marostegui>	 I need to change my name
[12:37:47] * kormat giggles
[12:46:22] <kormat>	 marostegui: you're spanish, you've probably got lots of spare ones
[12:46:38] <wikibugs>	 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db2091.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006171246_marostegui_25785.log`.
[12:47:46] <marostegui>	 kormat: I actually do have a second one, but never really use it
[12:47:50] <marostegui>	 Maybe I should start now!
[13:08:34] <wikibugs>	 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2091.codfw.wmnet'] `  and were **ALL** successful.
[13:18:25] <wikibugs>	 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui)
[14:25:48] <jynus>	 anyone working on db1091?
[14:26:13] <jynus>	 NOT talking about db2091
[14:26:29] <marostegui>	 I am not working on db1091
[14:26:32] <jynus>	 Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, No puppet role has been assigned to this node. (file: /etc/puppet/manifests/site.pp, line: 2349, column: 9) on node db1091.eqiad.wmnet
[14:26:35] <marostegui>	 what's up?
[14:26:50] <jynus>	 either deploy or network issue
[14:26:57] <marostegui>	 I worked on db2091
[14:26:59] <marostegui>	 So maybe a typo
[14:27:00] <marostegui>	 lket me check
[14:27:17] <marostegui>	 yep
[14:27:18] <marostegui>	 my bad
[14:27:19] <marostegui>	 fixing
[14:27:57] <jynus>	 oh, so it was related to that
[14:28:04] <jynus>	 weird error, was it on site.pp?
[14:28:22] <marostegui>	 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/606149/3/manifests/site.pp
[14:28:30] <marostegui>	 yeah, I removed db1091 instead of db2091
[14:28:32] <marostegui>	 regex fail
[14:29:07] <jynus>	 "we meet again, achnemesis" :-D
[14:29:53] <marostegui>	 this fixes it https://gerrit.wikimedia.org/r/606199
[14:30:07] <jynus>	 see my comment on ops
[14:30:14] <marostegui>	 yep
[14:31:14] <kormat>	 marostegui: oops, missed that
[14:31:20] <marostegui>	 no worries!
[14:31:50] <jynus>	 the other ongoing warning we have is: Last snapshot for s6 at codfw (db2097.codfw.wmnet:3316) taken on 2020-06-17 04:33:27 is 501 GB, but previous one was 534 GB, a change of 6.3%
[14:32:14] <jynus>	 6.3% size increase on s6 in 2 days, but only on codfw
[14:32:29] <jynus>	 let me see why
[14:32:50] <jynus>	 that is 30+ GB fast
[14:32:58] <marostegui>	 ran puppet and issue fixed on db1091
[14:33:08] <jynus>	 cool
[14:33:18] <marostegui>	 jynus: possible because of an schema change running and the temporary table being caught in the middle of the backup?
[14:33:25] <marostegui>	 jynus: s6 codfw is getting the MCR change
[14:33:30] <marostegui>	 so maybe that's why
[14:34:11] <jynus>	 I missread the direction
[14:34:18] <jynus>	 it decreased by 30GB
[14:34:28] <marostegui>	 then it is the schema change for sure
[14:34:31] <jynus>	 do you have a ticket/list of tables involved?
[14:34:33] <marostegui>	 as we are shriking the table
[14:34:38] <marostegui>	 revision and archive
[14:34:38] <jynus>	 so I can check on metadata?
[14:34:40] <jynus>	 thanks
[14:34:48] <jynus>	 I want to make sure we have not lost data
[14:35:08] <marostegui>	 It could also be text, but I am dropping very old columns there not being used, so i don't think they are those
[14:35:13] <marostegui>	 focus on archive and revision I would say
[14:35:55] <jynus>	 backups ids relevant are 6370 and 6320, I am checking the metadata of those tables
[14:39:35] <jynus>	 page hasn't changed a lot, but revision halved almost: phttps://phabricator.wikimedia.org/P11569
[14:39:48] <jynus>	 https://phabricator.wikimedia.org/P11569
[14:42:04] <jynus>	 I knew mcr was ongoing, but not that it would have such a dramatic decrease in size
[14:42:27] <marostegui>	 good news!
[14:42:40] <jynus>	 this was only a backup warning- should I keep it or increase the threashold?
[14:43:57] <jynus>	 I think it was helpful
[14:44:13] <marostegui>	 lets leave like that i think
[14:44:24] <marostegui>	 it is a good threshold
[20:50:20] <wikibugs>	 10DBA, 10Gerrit: Make sure `reviewdb-test` database (used forgerrit upgrade testing) gets torn down - https://phabricator.wikimedia.org/T255715 (10QChris)
[20:50:47] <wikibugs>	 10DBA, 10Gerrit: Make sure `reviewdb-test` database (used forgerrit upgrade testing) gets torn down - https://phabricator.wikimedia.org/T255715 (10QChris)
[20:50:50] <wikibugs>	 10DBA, 10Gerrit: Get a writable reviewdb clone to test Gerrit upgrade with - https://phabricator.wikimedia.org/T254516 (10QChris)