[04:45:24] 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) No, a BBU failure should not trigger a host reboot. Unfortunately, this is something we've seen with HP hosts over the years. Dell has also shown (sometimes) similar behaviours, which ended up with RAID controllers r... [05:13:18] 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) It should be ok to give them `CREATE TABLE` and even some need `CREATE TEMPORARY TABLE`, I think those two should be fine if they are needed. I would even... [05:16:55] 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Marostegui) dbproxy1008 grants removed from m3 (and also checked all the other mX sections): ` root@cumin2001:/home/marostegui# ./section m3 | while read host port; do echo $host; mysql.py -h$hos... [05:17:05] 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) dbproxy1008 grants removed from m3 (and also checked all the other mX sections): ` root@cumin2001:/home/marostegui# ./section m3 | while read host port;... [05:28:25] 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [05:28:35] 10DBA, 10Patch-For-Review: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Marostegui) [05:34:33] 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin2001 for hosts: `dbproxy1008.eqiad.wmnet` - dbproxy1008.eqiad... [05:35:27] 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [05:55:25] 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10Marostegui) [06:55:54] was db1117 upgraded to buster recently? [07:08:26] seems it was installed two days ago, yes [07:15:28] jynus: yep, like 2 days ago or so [07:16:14] please have a look at https://wikitech.wikimedia.org/wiki/MariaDB#Stretch_+_10.1_-%3E_Buster_+_10.4_known_issues [07:16:27] last bullet point [07:16:58] m* logical backups failed, I am reviewing the grants now [07:21:50] the grants aren't there [07:22:04] yeah, because I have already dropped them one by one :-D [07:22:17] No, I removed them [07:22:20] ? [07:22:25] earlier today [07:22:32] they where there [07:22:36] belive me [07:22:41] ok [07:23:33] https://phabricator.wikimedia.org/P11644 [07:23:46] they were not for dump@'10.64.0.95' [07:23:54] but they were for dump@'10.64.16.31' [07:24:04] ah, maybe i removed only for 0.95 [07:24:34] there is 2 users because backups can come from either dbprov hosts [07:26:25] I am taking the time to see also mismatches between configured and existing databases [07:26:32] like then one I commented on m5 [07:30:36] things should be now unified on codfw and eqiad, pending cloud answer about their dbs [07:31:15] I am going to retry misc dumps on both dcs [07:31:29] ok, thanks [07:31:47] we really need that account handling automation [07:31:59] it is a huge blocker for us :'-( [07:33:38] 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['dbproxy1021.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006240733_maro... [07:33:57] 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10Marostegui) [07:35:07] althought let's look at the bright side, we have monitoring that tells us what failed [07:35:28] that's better than nothing [07:44:20] es1025 and es2022 failed too, will have a look at those too [07:51:29] 10DBA, 10Patch-For-Review: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['dbproxy1021.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202006240751_maro... [07:56:03] also x1 snapshots on backup2002 worked well. FYI in case there is an emergency recovery [08:05:48] 10DBA: Add more information to --help option of transfer.py - https://phabricator.wikimedia.org/T253219 (10Privacybatm) Yeah, the transferpy is now available at https://doc.wikimedia.org/ :-) [08:08:24] marostegui: db1088 didn't die overnight. shall i start repooling it? [08:08:53] kormat: let's check it is all green on icinga, and enable notifications and then start repooling yeah [08:11:25] 10DBA: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbproxy1021.eqiad.wmnet'] ` and were **ALL** successful. [08:11:32] 10DBA: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 (10Marostegui) [08:22:50] andrewbogott: I am not sure what this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/595220/ but if it is being used, it needs to also live on dbproxy1021 [08:23:12] andrewbogott: is dbproxy1017 now active as in, being used? [08:25:48] jynus: db1140 was rebuilt recently? [08:25:57] in s6 [08:26:06] cannot remember [08:26:14] let me check [08:26:56] yes, that was the one with the hw problems [08:27:11] and now had 2 test dbs, but they are not being used except for testing [08:27:14] it was rebuilt from production or from an existing snapshot? [08:27:40] I just saw that a column that is being dropped for MCR is still there, so I was wondering if I missed that host or it was rebuilt recently [08:27:44] cannot say [08:27:55] ok, I will apply the change there [08:27:57] probably snapshot? [08:28:10] those are trash instances [08:28:15] so I can drop them [08:28:17] that would explain it yes [08:28:31] do not apply changes, I can rebuild it [08:28:42] ok [08:28:48] it was just a test to test snapshots on 10.4 [08:28:56] I can remove it and rebuild it [08:28:59] ok, sounds good [08:29:04] as long as the caninical backup source is good [08:29:09] *canonical [08:29:23] is it missing from zarcillo/tendril? [08:29:31] no, it is there [08:29:37] it is on s6 at least [08:29:48] so yeah, probably it got rebuilt as you were applying the change [08:30:00] let me find the uptime [08:30:17] Tue Jun 2 [08:30:33] yeah, it matches I think [08:30:34] but I think that was the host that had multiple issues [08:31:09] https://phabricator.wikimedia.org/T250602#6188173 [08:31:21] db1140 has been repopulated from dbprov snapshots of s1 and s6 and upgraded to 10.4. [08:31:24] yeah, that explains it then [08:31:28] Wed, Jun 3, 12:23 [08:31:39] yep, those snapshots wouldn't have the MCR schema change [08:31:43] mystery solved [08:31:52] there is an issue here [08:32:00] because if you upgrade the backup sources too soon [08:32:04] we could have issues [08:32:11] but if we do it last, we could have issues too [08:32:28] not sure how to "fix" that procedure [08:32:58] yeah, it is tricky [08:33:06] maybe it should be almost first [08:33:14] as after all, we still have old backups [08:34:24] on the other hand, we don't want to "experiment" with them having them upgraded too fast [08:34:28] I will put down db1140 [08:34:38] ok, I won't apply MCR there [08:34:47] latest snapshots should have it, right? [08:35:37] from last week I believe so yeah [08:39:32] s6 is gone from db1140 [08:40:41] will repopulate it later [08:40:49] thanks! [08:47:17] I am blocked on like 7 ongoing transfers and backups, will take a short break [08:47:36] enjoy [08:50:44] doing a puppet run across all db machines now that https://gerrit.wikimedia.org/r/c/operations/puppet/+/606708 is merged [08:52:19] \o/, sorry I'm late for a final review, glad it was merged! [08:52:22] i'll need to do a follow-up CR when elukey's CR gets merged [08:52:40] volans: np, thanks for your input :) [08:52:52] did a quick look now, +1 :) [08:53:05] cheers :) [08:54:39] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin2001.codfw.wmnet for hosts: ` ['db2120.codfw.wmnet'] ` The log can be found in `/var/log/wmf... [09:12:27] 10DBA: Package transferpy framework - https://phabricator.wikimedia.org/T253736 (10Privacybatm) [09:13:16] db1108 [09:13:23] ECHAN [09:13:50] uh? [09:14:38] wrong channel [09:16:18] ohh. puppet is disabled on that host. ok :) [09:18:11] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2120.codfw.wmnet'] ` and were **ALL** successful. [09:19:04] so s6 backups shrinked due to MCR [09:19:18] but s4 grew a 4% in 1 week [09:19:33] which table? [09:19:41] don't know yet [09:19:57] will look at it when I come back [09:20:23] but you may guess what I am thinking... [09:20:43] that's why we bought bigger disks :) [09:33:45] marostegui: we now have more powerful cumin queries available to us. e.g.: `sudo cumin "P{P:mariadb::mysql_role%role = master} and P{R:profile::mariadb::section = s1}"` [09:33:52] will match all masters in s1 [09:34:28] oh nice [09:34:36] that's really useful [09:35:04] feel free to add cumin aliases if you need them [09:36:51] volans: can aliases be parameterized? [09:37:08] it's erb so yes [09:37:09] e.g. could i have something like "A:mysql_role = master" [09:37:15] no i meant at run-time [09:37:19] ah, no [09:37:23] sorry :) [09:37:25] ok :) [09:37:48] but you can have a bunch of parameters and generate all the combinations [09:37:57] role/section/dc [09:39:18] we currently do it only for DCs and OSes but can be expanded [09:42:50] volans: on a related (but non-cumin) note, is it possible to query puppetdb to say "give me all values for this resource parameter"? [09:43:08] e.g. if i wanted to get a list of all section names for sanity checking [09:43:08] yes [09:43:21] give me a sec [09:43:36] which resource? [09:43:54] so for this example, the titles of https://gerrit.wikimedia.org/r/606708 [09:43:55] er [09:44:03] R:profile::mariadb::section [09:46:26] from a cumin host: [09:46:26] curl -Gs "https://puppetdb1002.eqiad.wmnet/pdb/query/v4/resources" --data-urlencode 'query=["and", ["=", "type", "Profile::Mariadb::Section"]]' | jq '.[].title' | sort | uniq -c [09:47:00] the sheer elegance astounds me. ;) [09:47:02] but if that's your new define, how could it have different values? [09:47:04] ahahaha [09:47:36] you might want to cross check it with some pre-existing class/resource maybe? [09:48:04] puppetdb query syntax is the main reason why we have cumin :) [09:48:22] volans: right now i'm just thinking of it in terms of "i want to double-check what shards are defined" [09:48:24] or better, a cumin puppetdb backend with its own grammar [09:48:34] s/shards/sections/ [09:48:52] k [10:00:23] 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Marostegui) [10:00:25] 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Marostegui) [10:00:28] 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Marostegui) [10:27:39] 10DBA: Remove grants for the old dbproxy hosts from the misc databases - https://phabricator.wikimedia.org/T231280 (10Marostegui) [10:30:34] 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Marostegui) [10:36:40] https://gerrit.wikimedia.org/r/c/operations/software/transferpy/+/607456 [10:37:10] I've tested the 0.1 version in production, I would like to upload it and puppetize it [10:37:26] but will wait for your ok [10:37:44] go for it, happy to start using the new version [10:38:27] the 0.1 doesn't have any huge feature change it is mosly refactoring and bug fixes [10:38:41] 10DBA, 10Operations, 10decommission-hardware, 10ops-eqiad, 10Patch-For-Review: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Marostegui) haproxy stopped, let's give it a few days to make sure nothing breaks. [10:44:13] commonswiki.image has grown 5 GB after compression in a week [10:44:20] I think that's it [10:45:01] so that's not related to SDC then [10:46:08] but I don't see a huge trend change on uploads: https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&from=now-30d&to=now-1m [10:50:09] looking at this it seems like at some point in the last week upload rate doubled: https://grafana.wikimedia.org/d/000000034/media?panelId=24&fullscreen&orgId=1&refresh=5m [10:50:38] 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) As per our IRC chat, I have changed the database to `cas_test` [10:50:47] so probably some kind of import process [10:53:11] 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jcrespo) Please ping me when db is setup but before closing this ticket to make sure backups are correctly configured, as this seems to be an important database to no... [10:54:04] 10DBA, 10Patch-For-Review: Package transferpy framework - https://phabricator.wikimedia.org/T253736 (10Privacybatm) 05Open→03Resolved Merged the packaging patch! [10:54:07] 10DBA, 10Google-Summer-of-Code (2020), 10Patch-For-Review: GSoC 2020 Proposal: Improve the framework to transfer files over the LAN - https://phabricator.wikimedia.org/T248256 (10Privacybatm) [10:57:16] 10DBA, 10Patch-For-Review: Package transferpy framework - https://phabricator.wikimedia.org/T253736 (10jcrespo) Congrats on a good job! Let's add to the TODO list to add a proper man (1) page to the package (we could generate it -maybe- from the help). [10:59:39] 10DBA, 10Patch-For-Review: Package transferpy framework - https://phabricator.wikimedia.org/T253736 (10Privacybatm) Thank you! Yeah, I will do that. [11:06:25] we are generating public xmldumps and private backups at the same time on the same server: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=es1025&var-port=9104&from=1592946249757&to=1592996726683 [11:12:08] 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) @jbond users created with access from the requested hosts: ` +--------------------------------------------------------------------------------------------... [11:15:19] 10DBA, 10Operations, 10CAS-SSO, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Once this is tested and ready to move to production m1, I will work on the .sql files to keep track of the new grants for the dbproxies IPs. @jbond rememb... [12:09:30] do you want to double check db1140:3316 to see that the MCR change is now applied? [12:52:05] 10DBA, 10Patch-For-Review: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10jcrespo) [13:03:12] 10DBA, 10Patch-For-Review: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) a:05jcrespo→03Marostegui Assigning back to me to review what's pending after getting db1145 done by Jaime [13:11:34] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10jcrespo) So db1145 is done, loaded from db1102 snapshots (stretch), and functionality (backup sourcing) moved to it. Assigning it to you so you can check what ahs been done and decide if to resolve or wait to check new backups occurr... [13:12:48] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10jcrespo) I intend to "take" db1102, delete its data and setup x1 with buster on it to generate 10.4 backups. [13:14:01] 10DBA: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10jcrespo) See: T253217#6252401 [13:19:59] 10DBA: Upgrade x1 databases to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254871 (10Marostegui) [13:20:55] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) >>! In T253217#6252401, @jcrespo wrote: > I intend to "take" db1102, delete its data and setup x1 with buster on it to generate 10.4 backups. Remember that you can also take db1084 anytime now (needs to be depooled first) [13:32:45] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) @jcrespo leaving notifications disabled in puppet for db1145 is intended? db1145 is green on icinga, so not sure what your plans are regarding notifications for that host. [13:33:17] ^marostegui I think there were disabled but puppet didn't run yet [13:33:24] unless I made a mistake [13:33:53] jynus: ah ok! cool that must be it yep [13:33:59] it could be that I forgot [13:34:05] let me double check the patch [13:34:21] yeah, this is the intention: https://gerrit.wikimedia.org/r/c/operations/puppet/+/607493/1/hieradata/hosts/db1145.yaml [13:34:34] cool [13:34:40] then it is puppet pending to run [13:34:41] I didn't worry too much to see it deploying because it is not a core host [13:34:41] all good [13:35:11] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) 05Open→03Resolved All done - puppet pending to run and enable notifications on db1145 [13:35:14] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: (Need By: 31st May) rack/setup/install db114[1-9] - https://phabricator.wikimedia.org/T251614 (10Marostegui) [13:35:15] I only do syncronous checks on important hosts [13:35:33] I didn't know if to close it if to wait for next weeks backups [13:35:37] I left it to you [13:35:54] we can always reopen or create an specific task for that if backups fails [13:36:00] yeah [14:49:13] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) >>! In T256120#6252052, @Marostegui wrote: > Once this is tested and ready to move to production m1, I will work on the .sql files to kee... [14:50:06] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Should be fixed now. [15:28:00] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [15:33:38] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) I have altered s6 entirely (frw... [16:00:19] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Papaul) a:05Papaul→03Kormat Disk replacement complete [17:22:58] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Dzahn) @dpifke @Marostegui Hi, I did the following: - added new random password in private puppet in hieradata: `puppetmaster1001:/srv/private/hieradata/role/common/webperf/xhgui.yaml` `profile::webperf... [17:27:20] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10dpifke) LGTM, thanks! [18:36:20] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Papaul) Return information {F31903968}