[04:54:14] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T219852 (10Marostegui) 05Open→03Resolved All good! Thanks! ` root@db2070:~# hpssacli controller all show config  Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337FADD0)      Port Name: 1I     Port Name:...
[04:59:43] <wikibugs>	 10DBA, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 2 others: Change job table to use mediumblob for job_params field - https://phabricator.wikimedia.org/T219887 (10Marostegui)
[05:08:26] <wikibugs>	 10DBA, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 2 others: Change job table to use mediumblob for job_params field - https://phabricator.wikimedia.org/T219887 (10Marostegui)
[05:08:43] <wikibugs>	 10DBA, 10Patch-For-Review: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 (10Marostegui)
[05:20:06] <wikibugs>	 10DBA, 10Patch-For-Review: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 (10Marostegui)
[05:26:34] <wikibugs>	 10DBA, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 2 others: Change job table to use mediumblob for job_params field - https://phabricator.wikimedia.org/T219887 (10Marostegui) s3 hosts: [x] labsdb1012 [x] labsdb1011 [x] labsdb1010 [x] labsdb1009 [x] dbstore1004 [x] db1124 [x] db1123 [x] db1095 [...
[05:34:51] <wikibugs>	 10DBA, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 2 others: Change job table to use mediumblob for job_params field - https://phabricator.wikimedia.org/T219887 (10Marostegui)
[05:35:10] <wikibugs>	 10DBA, 10CirrusSearch, 10Discovery, 10MediaWiki-JobQueue, and 2 others: Change job table to use mediumblob for job_params field - https://phabricator.wikimedia.org/T219887 (10Marostegui) 05Open→03Resolved This is all done
[05:38:38] <wikibugs>	 10DBA, 10Patch-For-Review: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 (10Marostegui)
[05:45:11] <wikibugs>	 10DBA, 10Patch-For-Review: Decommission 2 codfw x1 hosts db2033 and db2034 - https://phabricator.wikimedia.org/T219493 (10Marostegui)
[05:56:08] <wikibugs>	 10Blocked-on-schema-change, 10Notifications, 10MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), 10Patch-For-Review, 10Schema-change: Remove unused bundling DB fields - https://phabricator.wikimedia.org/T143763 (10Marostegui) I will probably depool all slaves on x1 on Monday early in the morning and go ahead an...
[06:53:41] <marostegui>	 would the 24th work for you for s3 failover? https://phabricator.wikimedia.org/T219115#5081886 if so, I will start the process with the liasons to request read only etc
[07:11:55] <wikibugs>	 10DBA, 10TechCom-RFC: MediaWiki database policy and/or guidelines (2019) - https://phabricator.wikimedia.org/T220056 (10jcrespo) Jdforrester-WMF's edits are fair, and I think we should separate changes to mediawiki data architecture, which should be discussed with any or every mediawiki stakeholder, and change...
[07:18:42] <hashar>	 jynus: sorry about the sql_mode=TRADITIONAL thing :/
[07:19:00] <hashar>	 the CI containers use it, but MediaWiki nicely unset all modes by default
[07:19:18] <jynus>	 thanks, see https://phabricator.wikimedia.org/T119371#5084172
[07:19:34] <hashar>	 yeah
[07:19:49] <hashar>	 and Aaron did +2 it on monday but the change did not merge because hmmm  reasons ;D
[07:19:53] <hashar>	 (ci broke)
[07:47:12] <hashar>	 and announced it at https://lists.wikimedia.org/pipermail/wikitech-l/2019-April/091877.html
[07:47:30] <hashar>	 jynus: the mediawiki change has merged so CI should now run tests with TRADITIONAL
[07:47:50] <hashar>	 sorry I thought that enabling it on the server side was sufficient and missed the fact that mw overrides it :^\
[07:50:15] <jynus>	 thanks!
[07:52:35] <jynus>	 hashar: is there a way to see if essential things like core unit tests fail right away or something?
[07:53:55] <hashar>	 well they pass :]
[07:54:04] <jynus>	 cool
[07:54:17] <hashar>	 since CI ran the test with $wgSQLMode = 'TRADITIONAL'  
[07:54:22] <hashar>	 they all passed and that merged
[07:54:36] <hashar>	 then who knows what else is broken. We would need to run the tests on every single repositories and grab the results
[07:54:52] <jynus>	 sure, that is ok, we may have complains
[07:55:11] <jynus>	 I do know they are extensions doing bad stuff
[08:00:47] <marostegui>	 jynus: sorry to ask again, did you see my comment above? I want to make sure we give enough heads up
[08:03:49] <jynus>	 which one?
[08:04:02] <marostegui>	 ˜/marostegui 8:53> would the 24th work for you for s3 failover? https://phabricator.wikimedia.org/T219115#5081886 if so, I will start the process with the liasons to request read only etc
[08:04:12] <jynus>	 I did not see that
[08:05:36] <jynus>	 it works for me, but I would prefer something like the 10th
[08:06:19] <marostegui>	 10th of may?
[08:06:25] <jynus>	 april
[08:06:47] <marostegui>	 ok, I will let them know we have to rush for this then
[08:06:59] <jynus>	 I don't want to come during easter
[08:07:06] <jynus>	 because the master is down
[08:07:24] <marostegui>	 yep
[08:07:53] <jynus>	 yes, it is not an immediate emergency, but I would say any hw issue on the master is an emergency
[08:08:06] <marostegui>	 sounds good
[08:08:07] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) We have agreed that we want to aim for **10th April** to avoid the risk of the master going down unexpectedly during the upcoming Easter holidays where there will be less c...
[08:08:11] <marostegui>	 I will create the RO ticket
[08:08:15] <jynus>	 we can involve mark if communication complains
[08:08:25] <jynus>	 let him handle it
[08:08:39] <marostegui>	 I am sure they will understand :)
[08:08:55] <jynus>	 I can drop all backup stuff and work on this if you don't have the time
[08:09:10] <marostegui>	 no, it's ok!
[08:09:14] <marostegui>	 don't worry
[08:09:22] <marostegui>	 I will let you know if I need more hands :)
[08:09:40] <jynus>	 I think the only thing tha worries me is cordinating with network
[08:09:48] <jynus>	 to see which is the best candidate
[08:09:55] <jynus>	 so we don't need a failover again
[08:10:27] <marostegui>	 As far as I know there is no network maintenance scheduled but I will ask arzhel
[08:11:16] <jynus>	 note all of these are suggestions
[08:11:30] <jynus>	 I am not telling you to do that, just expressing my opinion
[08:11:40] <marostegui>	 I know :)
[08:11:42] <jynus>	 (because you asked :-))
[08:12:48] <jynus>	 also doesn't have to be the 10, 9 or 11 would be ok, I think, too
[08:13:13] <marostegui>	 11th would work better for me personally
[08:17:11] <marostegui>	 Going to aim for the 11th if you don't mind
[08:18:20] <jynus>	 ok
[08:20:01] <jynus>	 I will let you own that as you started working on that, but ask me anything you need help with (contacting network, communications, dcops, preparing patches, etc.)
[08:20:14] <marostegui>	 sure, thanks!
[08:49:47] <mark>	 we're doing the meeting at the normal time today
[08:49:54] <marostegui>	 thanks
[09:50:17] <jynus>	 preparing is what it takes most of the time for the backup, I think stopping replication may be a needed option to speed it up
[09:51:35] <marostegui>	 worth trying yeah!
[09:54:10] <jynus>	 maybe a 'stop_slave: True' option, that if the replica is running, stops it and the starts it again
[09:54:33] <marostegui>	 That'd be cool yeah
[09:54:45] <marostegui>	 But I guess you'll test it before to see if it is worth implementing?
[09:54:48] <jynus>	 I wonder if it should stop the sql thread or both, stopping the sql thread is the only thing needed, but it may create issues with too much backlog for things like s8
[09:54:58] <marostegui>	 I would stop both
[09:55:08] <jynus>	 It may also need some monitoring changes
[09:55:26] <marostegui>	 we can do a downtime: True
[09:55:26] <marostegui>	 haha
[09:55:40] <jynus>	 I think I may file a specific task just for this to evaluate all consequences
[09:56:00] <marostegui>	 Or follow up on the big one?
[09:56:49] <marostegui>	 I mean here: T206203
[09:56:49] <stashbot>	 T206203: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203
[11:12:55] <jynus>	 I may file a subtask, depending on how backups run today
[11:13:09] <jynus>	 with https://gerrit.wikimedia.org/r/500980
[11:13:23] <jynus>	 (I may have to disable backups after that, on friday)
[11:14:52] <jynus>	 s8 backup, started at 20 UTC is still being prepared
[11:16:41] <marostegui>	 :(
[11:24:02] <jynus>	 probably because it has to commit 8 hours of wikidata with 10GB of memory
[11:24:06] <jynus>	 on HDs
[11:24:23] <jynus>	 stopping slave will fix that :-)
[11:24:52] <jynus>	 I will deploy the above patch if/when it finishes
[12:36:04] <marostegui>	 no network maintenance planned for db1075's row, in fact, that row is on good shape, and the one where db1078's (current master) lives, is the one that will require maintenance at some point
[12:36:09] <marostegui>	 so we are good
[12:38:15] <arturo>	 o/
[12:38:45] <arturo>	 I'm doing T220096 and I would appreciate advices for me running mysqldump
[12:38:47] <stashbot>	 T220096: codfw1dev: decide which DBs to reallocate to cloudcontrol2001-dev - https://phabricator.wikimedia.org/T220096
[12:39:18] <arturo>	 would `--all-databases` work just fine?
[12:41:45] <marostegui>	 arturo: I normally also add —routines —events —triggers  just in case
[12:42:00] <marostegui>	 And —single-transaction
[12:42:25] <arturo>	 ok, so something like `mysqldump --all-databases --routines --events --triggers --single-transaction`
[12:43:00] <marostegui>	 yeah, you have to redirect that to a file
[12:44:07] <arturo>	 thanks
[12:49:28] <jynus>	 arturo: we have a script to wrap that
[12:49:32] <jynus>	 if interested
[12:49:45] <jynus>	 or it is just a one time thing?
[12:50:00] <arturo>	 one time thing apparently
[12:50:47] <jynus>	 anyway, if it takes a long time, tell us, we have faster methods now
[12:51:17] <arturo>	 it took just a couple of secs
[12:51:22] <jynus>	 cool then
[13:03:08] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10Marostegui) p:05Triage→03Normal a:03Papaul Can we get it replaced? Thanks!
[13:14:38] <marostegui>	 jynus: what should we do with this task? https://phabricator.wikimedia.org/T205628
[13:20:52] <jynus>	 don't metas prisa!!! :-D
[13:21:02] <jynus>	 it is coming (like winter)
[13:21:52] <marostegui>	 xddddddddd
[13:22:12] <jynus>	 that is part of zarcillo work, which is in a freezer
[13:22:16] <marostegui>	 well, winter is actually coming THIS week in Spain :)
[13:22:39] <jynus>	 it makes no sense on doing that if we don't have metadata inventory
[13:22:41] <marostegui>	 jynus: got it, I was asking in case it wasn't valid or something anymore :)
[13:22:47] <jynus>	 so soft blocked on
[13:22:54] <jynus>	 T104459
[13:22:55] <stashbot>	 T104459: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459
[13:23:05] <jynus>	 which I think it is already as a dependency
[13:23:22] <jynus>	 I think I gave you a query some days ago to fill it in
[13:23:39] <jynus>	 if you still have it, paste it there
[13:23:54] <marostegui>	 mmmm not sure I am following you
[13:24:10] <jynus>	 remember the "list of large tables" we sent to wikidata
[13:24:24] <jynus>	 and you said "I will keep that query for later"
[13:24:36] <marostegui>	 aaaah
[13:24:37] <jynus>	 it will be on the logs
[13:24:40] <jynus>	 don't worry
[13:24:48] <marostegui>	 And I do!
[13:25:02] <wikibugs>	 10DBA, 10Gerrit, 10Operations, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) 05Stalled→03Resolved Closing this as resolved since this will be resolved with T200739
[13:25:10] <marostegui>	 select substring_index(substring_index(substring_index(file_name, '.', 2), '.', -1), '-', 1) as `table`, sum(size) as size FROM backup_files where backup_id = 747 and file_name NOT IN ('metadata', 'wikidatawiki-schema-create.sql.gz') GROUP BY `table` ORDER BY size DESC LIMIT 20;
[13:25:14] <jynus>	 it is mostly doing that in an event, or a cron, or the backup stuff
[13:25:15] <marostegui>	 I think that was it
[13:25:19] <jynus>	 that will fill up that table
[13:25:30] <jynus>	 I thought about creating an event to do that
[13:25:46] <jynus>	 and another event to set as failed ongoing backups for >24 hours
[13:25:50] <marostegui>	 I knew deep in your heart you liked events
[13:25:51] <marostegui>	 I knew it!
[13:25:59] <jynus>	 then I realized I hate events
[13:26:06] <jynus>	 so I will create a cron job
[13:26:20] <marostegui>	 too late, you like them
[14:02:29] <jynus>	 "Fully automated and continues code health and deployment infrastructure" sounds familiar, marostegui?
[14:02:54] <marostegui>	 :-)
[15:41:55] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T220102 (10RobH) Please note this system is out of warranty and any disk swaps will need to be accomplished with on site spares.
[17:39:37] <wikibugs>	 10DBA, 10Goal, 10Patch-For-Review: Implement database binary backups into the production infrastructure - https://phabricator.wikimedia.org/T206203 (10jcrespo) Despite the failure and kill, it is nice because a process kill gets properly handled and the backup is set as failed.