[03:30:25] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018): Determine how to update old compressed ExternalStore entries for T181555 - https://phabricator.wikimedia.org/T183419#3885446 (10Anomie) >>! In T183419#3855448, @Anomie wrote: > I'll run something to try to do a more accurate count, but it'll probably tak... [05:18:18] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2055 - https://phabricator.wikimedia.org/T184285#3885517 (10Marostegui) 05Open>03Resolved Thanks! ``` root@db2055:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337C9270) Port Name: 1I Port Name:... [05:53:47] hi [05:54:12] hellow [05:54:27] we are on -operations [05:55:30] ok [06:34:48] 10DBA, 10Data-Services: Provide a new s8 master for sanitarium - https://phabricator.wikimedia.org/T177274#3885590 (10Marostegui) This can now proceed as the master has been failed over [06:35:38] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3885592 (10Marostegui) [06:36:27] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3836595 (10Marostegui) >>! In T174569#3877489, @Marostegui wrote: > > Once s7 is done, we only have s5 and s8 pending, which is blo... [06:38:06] write throughput of s8 is less than half than that of s5, but I assume that is expected [06:38:17] many bots failing to write stopped? [06:38:21] plus announcement [06:38:49] I don't see any failures on writes, so I assume it is normal [06:40:10] s5 has gone to less than 25% [06:40:29] * marostegui seeing how many serves we can get from there }:-) [06:40:30] latency whent up to 9 seconds [06:41:44] labsdb1004 stopped replication [06:41:45] checking why [06:42:12] Last_SQL_Error: Could not execute Delete_rows_v1 event on table s51230__linkwatcher.linkwatcher_linklog [06:42:41] uptime seems ok [06:42:47] ok, that can wait [06:42:55] happened tonight? [06:42:59] yeah I was checking if it is was somehow related [06:43:07] but it is not [06:43:34] 180108 17:42:55 [06:43:38] broke yesterday evening [06:44:53] I am going to duplicate s5 on dbstore2001, if you are ok with that [06:44:57] to setup s8 there [06:44:58] sounds good! [06:45:09] we have to add it to sanitarium too [06:45:16] later, or if you want to do the labs stuff? [06:45:27] or I can do that afterwards [06:45:31] let me fix labsdb1004 as it has been broken for hours [06:45:38] actually, you should take a break? [06:45:48] so do you! [06:45:49] :) [06:45:53] I did nothing [06:46:27] you did all the monitoring and checks which is even more important [06:46:32] I will use the etherpad coordinates as canonical for new s8 setups [06:46:36] ok :) [06:47:33] I will double check there is only garbage/heartbeat between yours and the restart [06:48:11] good [06:48:13] thanks :) [06:48:37] everyday I am more convinced we have to fix the heartbeat setup [06:49:02] although in this case it was not really easy, as it was not a regular failover [06:52:00] mysqlbinlog --start-position=564466566 --stop-position=32015 db1070-bin.001739 db1070-bin.001740 | egrep -v '^(#|\/\*|BEGIN|SET|COMMIT|DELIMITER|ROLLBACK|use)' | less [06:52:07] ^can confirm there is only heartbeat [06:52:27] \o/ [06:53:12] Yeah, I can confirm that :) [06:56:35] I made that second measurement after reset slave [06:56:47] that is what I got a second pair of coordinates [06:57:19] mysqlbinlog --start-position=280263966 --stop-position=280422244 db1071-bin.006096 | egrep -v '^(#|\/\*|BEGIN|SET|COMMIT|DELIMITER|ROLLBACK|use)' | less [06:57:22] that is a good safety net [06:57:23] lets see [06:57:27] same for db1071 [06:58:09] only heartbeat [06:58:17] yep [06:59:03] so as long we we point new s8 hosts at db1071-bin.006096:280263966 we are ok [06:59:17] we may need a manual heartbeat insert though [06:59:35] yeah, probably [06:59:39] at least on dbstore/sanitarium [07:00:23] I am filling in the therpad with those coords [07:00:33] great [07:16:06] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018): Determine how to update old compressed ExternalStore entries for T181555 - https://phabricator.wikimedia.org/T183419#3885603 (10jcrespo) That seems not too bad- I assume it takes so much time because you run it serially, which is preferable to avoid over... [07:46:07] 10DBA, 10Data-Services: Create backups of user tables from decommissioned database servers - https://phabricator.wikimedia.org/T183758#3885626 (10Marostegui) Thanks a lot @bd808 for double checking. I have removed those files and this is now how it looks like: ``` tar -tvf labsdb_backup.tar drwxr-xr-x root/roo... [08:42:19] hello people [08:42:21] morning :) [08:42:46] today me and chris would like to stop/shutdown dbstore1002 to fix the drac issue [08:42:53] (15:00 utc) [08:44:42] do we need to postpone the maintenance because of ongoing work (like s8 replication, etc..) or is it fine to proceed? [08:55:12] elukey: I don't think you have to postpone it :) [08:55:53] keeps ups updated however when you start/stop it so we are aware [08:55:57] *us [08:56:16] super thanks :) [08:56:43] fixing dbstore1002 should be easy, I am just doing dbstore2001 first [09:00:40] I will fix dbstore1002 so we can forget about it [09:00:52] it should be as easy as create the new channel s8 with db1071 coords [09:02:03] that is right- probably also an insert on heartbeat [09:02:08] yep [09:03:05] aparently dbstore2001 disk is not performant enough to make an s5 copy as fast as SSD-based hosts [09:05:00] https://gerrit.wikimedia.org/r/403119 [09:09:03] BTW. not sure why we do filters on dbstores [09:09:10] unrelated to the commit [09:11:47] yeah, I guess there was other stuff there ages ago or something? [09:14:25] dbstore1001: we can actually add s8 channel tomorrow so we don't have to do any weird stuff, once we have passed the failover time, it should be fine to add s8 channel just as a normal channel as no wikidata writes will be happening on the already existing s5 channel [09:21:16] elukey: dbstore1002 is now replicating s8 [09:21:41] \o/ [09:33:12] I have added coords for db1095 gathered from db1087 to the etherpad, but I prefer another pair of eyes to review it (no rush() [09:34:02] on the server? [09:34:15] probably also checking the sanitization [09:34:30] yeah, another pair of eyes to double check the position [09:34:32] did you adde the extra filter? [09:34:43] I didn't touch db1095 [09:34:44] I think that requires puppet changes? [09:34:54] so where? [09:35:13] I didn't see a gerrit review [09:35:21] ˜/marostegui 10:33> I have added coords for db1095 gathered from db1087 to the etherpad [09:35:27] ah [09:35:28] ok [09:35:29] hehe [09:35:34] sorry [09:35:45] no, no worries [09:36:07] I am actually thinking to go downstairs for a proper tea and have some rest as I am feeling tired :) [09:56:55] I am going to check that now, I was distracted by dbstore2001 [10:01:30] no worries [10:01:34] I think I am going to take a break [10:01:46] btw I checked and I don't think we have to change puppet for db1095 [10:02:01] just add manually the wild_do_table when bringing up the new s8 channel [10:02:57] cool then [10:03:12] let me check for real, I get distracted [10:04:57] this is correct, db1087-bin.001749:595931565 should work [10:05:16] I took that from the time you suggested [10:05:44] yours is an event latter, which is also ok [10:06:43] thanks for checking [10:06:53] feel free to add s8 channel there or I will do it a bit later myself :) [10:07:04] I can do that [10:07:10] going for a tea now [10:07:18] note there is nothing to drop there, that is only for multi-instance [10:07:24] yep [10:09:42] I now realize I will have to do the same for dbstore1002, as it is not a direct replica either [10:09:48] *dbstore2001 [11:03:54] 10DBA, 10Data-Services: Provide a new s8 master for sanitarium - https://phabricator.wikimedia.org/T177274#3885818 (10jcrespo) 05Open>03Resolved a:03jcrespo db1087 connected and replicationg to sanitarium. Other cloud stuff will be handled on T184179 [12:34:08] I wil finish dbstore2001 and call it a day, and not drop any table today- clearly I am not with the right mind [12:44:35] <_joe_> just saw on terbium [12:44:43] <_joe_> Notice: /Stage[main]/Proxysql/File[/var/run/proxysql]/mode: mode changed '0755' to '0750' [12:44:46] <_joe_> Notice: /Stage[main]/Profile::Proxysql/File[/run/proxysql]/mode: mode changed '0750' to '0755' [12:45:03] <_joe_> that seems to happen on every puppet run [12:45:12] <_joe_> fyi :) [12:50:31] doesn't look good [12:51:19] I will look at it tomorrow, you can disable proxysql if it is problematic- it is not currently in use and port is not exposed outside [12:52:03] wait, Proxysql/File[/var/run/proxysql] and Profile::Proxysql/File[/run/proxysql] ? [12:52:16] shouln't that create a puppet error, rather than tring to execute both? [12:52:42] ah, probably /var/run is a symbolic link to /run [12:52:53] so technically it is correct [12:53:04] but a huge fail for me [12:53:38] it should be trivial to fix, I think, I haven't thought about the symlink possiblity [12:53:54] sorry about that [12:57:55] actually that shouldn't be on run, it is partially the package's fault, it should be on /var/lib, not on /var/run [12:59:01] oh, no it is my typo [12:59:14] I mixed /var/run and /var/lib [13:05:21] 10DBA, 10Operations, 10Performance-Team, 10Availability (Multiple-active-datacenters): Perform testing for TLS effect on connection rate - https://phabricator.wikimedia.org/T171071#3886222 (10jcrespo) I @aaron, I would like still to reproduce your results. Meanwhile, I thought a reason why that can be- wit... [13:09:33] _joe_: fixed, thank you: [13:09:40] Info: Applying configuration version '1515503322' [13:09:44] Notice: Applied catalog in 27.00 seconds [13:09:54] <_joe_> jynus: thank you for fixing it <3 [13:09:58] <_joe_> now go rest :) [13:10:19] it was actually a bad bug [13:10:55] it missed making permissions more restrictive, actual problem for multi-user proxydb instllations [13:11:10] (which we had no one yet) [13:18:05] I was thinking about deploying the schema change on s5, on codfw master with replication [13:18:10] But i think I will do it tomorrow instead [13:19:20] +1 [13:24:42] I have added dbstore2001:3318 to tendril [15:11:56] 10DBA, 10MediaWiki-Platform-Team, 10Structured-Data-Commons, 10Wikidata, and 2 others: MCR schema migration stage 0: create tables - https://phabricator.wikimedia.org/T183486#3886723 (10Anomie) 05Open>03Resolved a:03Anomie [15:22:05] hello people [15:22:24] chris is in the DC now and we'd be ready to start the procedure to shutdown dbstore1002 [15:25:12] marostegui: --^ [15:28:15] elukey: do you guys need me for that? I was already at the sofa, we started at 6am today :) [15:28:29] The mysql thing should be: stop all slaves; [15:28:31] ahahah no no you guys told me to alert you beforehand [15:28:34] And then mysql stop [15:28:35] to have the green light [15:28:44] Sure, dbstore1002 is already replicating s8 [15:28:47] so we are done with it [15:28:49] sorry didn't mean to bother :( [15:28:53] thanks! [15:28:57] No, you didn't bother at all :) [15:29:03] I hope it comes back finely [15:29:08] * marostegui scared of rebooting old servers [15:29:14] shhhhh never say that [15:29:17] :-P [15:29:35] oh hi cumin [15:29:46] ahahahahah [15:30:27] ahhahaha [15:30:32] what is cumin? [15:30:39] a real thing? [15:31:40] elukey: volans has a notification for when cumin is written, I am 90% sure now, so you can trust me [15:32:15] cumin cumin? [15:32:20] I will confirm that in SF in a couple of weeks [15:32:32] all right cumin [15:32:35] c u m i n [15:32:41] rotfl, I see you have already made a plan [15:43:26] 10DBA, 10Commons, 10MediaWiki-Watchlist, 10Wikidata, and 2 others: re-enable Wikidata Recent Changes integration on Commons - https://phabricator.wikimedia.org/T179010#3886807 (10Lydia_Pintscher) No it is still waiting on Marius. Marius: I'll put int in the next sprint and then Amir can handle it if you do... [15:44:07] 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Performance, and 2 others: re-enable Wikidata Recent Changes integration on Russian Wikipedia - https://phabricator.wikimedia.org/T179012#3886810 (10Lydia_Pintscher) [16:09:41] maintenance done, host back up. Started mysql and the slave replication, everything looks good [16:09:55] mgmt interface works now [17:33:58] 10DBA, 10Data-Services, 10cloud-services-team: Maintain-views and maintain_meta-p scripts shouldn't run if mysql-upgrade is running - https://phabricator.wikimedia.org/T184540#3887176 (10madhuvishy) [17:34:13] 10DBA, 10Data-Services, 10cloud-services-team: Maintain-views and maintain_meta-p scripts shouldn't run if mysql-upgrade is running - https://phabricator.wikimedia.org/T184540#3887189 (10madhuvishy) p:05Triage>03Normal [17:51:30] 10DBA, 10Data-Services: Create backups of user tables from decommissioned database servers - https://phabricator.wikimedia.org/T183758#3887307 (10bd808) >>! In T183758#3885626, @Marostegui wrote: > I will wait for your ok before copying to labstore1003. Looks good to me. Thanks for all the work on this @Maros... [18:08:08] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018): Determine how to update old compressed ExternalStore entries for T181555 - https://phabricator.wikimedia.org/T183419#3887403 (10Anomie) I have no planned time yet. First https://gerrit.wikimedia.org/r/#/c/397632/ needs to be reviewed and merged. I've al... [18:18:09] 10DBA, 10Data-Services: Create backups of user tables from decommissioned database servers - https://phabricator.wikimedia.org/T183758#3887442 (10Marostegui) Thanks for double checking. I have left it at: `labstore1003:/root/labsdb_backup.tar` And its md5sum looks good: `d4bde7692b1c18e55079e57fa4aa8316` So I... [18:40:06] 10DBA, 10Data-Services, 10Goal, 10Patch-For-Review, 10cloud-services-team (FY2017-18): Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#3887506 (10bd808) [18:40:09] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Create backups of user tables from decommissioned database servers - https://phabricator.wikimedia.org/T183758#3887502 (10bd808) 05Open>03Resolved a:03Marostegui The dumps are now published on the "scratch" volume accessible from Toolforge: ``` $... [19:54:20] 10DBA, 10Data-Services, 10cloud-services-team: Maintain-views and maintain_meta-p scripts shouldn't run if mysql-upgrade is running - https://phabricator.wikimedia.org/T184540#3887176 (10Krenair) Is mysql-upgrade going to ensure it doesn't run while anything else is doing DDL?