[00:38:47] 10DBA, 10Gerrit, 10Operations, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532 (10Paladox) p:05Normal→03High Per T200739#5034407 [00:40:48] 10DBA, 10Operations, 10decommission, 10ops-eqiad: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10RobH) [11:39:45] ok to install PHP sec updates on dbmonitor*? [11:40:04] sure [11:41:40] going ahead [11:43:10] done, tendril confirmed working fine [13:22:53] jynus: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497469/ [13:25:24] is that all replicas? [13:25:53] yes [13:27:43] I need some time to review [13:27:56] of course [13:27:58] also [13:28:02] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497472/ [13:28:51] maybe you can give a look to icinga meanwhile :-) [13:29:07] something broken? I have it opened [13:29:20] no, but I guess alerts? [13:29:41] replicas + all of s2 + all of x1, etc. [13:29:51] No, I was asking if there is something broken that I haven't realised cause I always have it open :) [13:29:59] nothing that I know [13:30:11] :) [13:34:01] db1126 is a new host? [13:34:22] yes it is T211613 [13:34:22] T211613: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 [13:35:13] yep not in prod [13:35:25] eventlogging will be affected [13:35:48] they are aware [13:35:50] elukey: ^ [13:36:22] yep there is a note in the task to ping me beforehand [13:38:13] question, why not using teh spare host instead for pc1007 replacement? [13:38:31] it will have the same issues, but at least it will not contaminate teh good one [13:38:44] what? [13:39:12] pc1010 is the spare one [13:39:15] line 11 of https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/497469/1/wmf-config/db-eqiad.php [13:39:29] oh, I see [13:39:32] it is duplicated [13:39:41] that made me read it wrongly [13:40:39] my fault [13:40:59] no one's fault! [13:41:04] better be safe [13:41:10] I have to check dbproxy1001 and 2, not sure if the are active or passive [13:41:28] They are passive [13:41:31] but double check please [13:41:33] I think they were depooled las time [13:41:39] but yeah, I want to double check [13:41:41] yeah, I depooled them [13:41:45] check them though, please [13:41:59] ^I wrote the above note as a reminder [13:42:15] rack 4/8 :-D [13:43:26] lots of 1U servers lately [13:44:05] btw, (not related, but just remembered it) did you see my comment regarding the dump slaves quote? [13:49:36] quote? [13:49:44] oh, ticket [13:49:47] yes [13:49:56] didn't add anything because I had nothing to add [13:50:10] I was wondering if we should assign back to robh? [13:50:31] sure [13:52:53] and of course db1115 in the middle of the backups :-) [13:54:12] haha+ [13:59:22] marostegui: which ticket even is the one i should be looking at >.> [13:59:35] what do you mean? [13:59:45] for the cognate thing, i totally lost the ticket :( [13:59:49] Ah [13:59:51] T187960 [13:59:51] T187960: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 [14:00:11] So basically x1 master will be on read only for a few seconds (on mysql level) as there is no way to disable x1 on mediawiki-config [14:00:29] s2 will be on read only for a few minutes (that will be disabled on mediawiki config) [14:00:58] addshore: in an ideal world, all extensions detect they are in read only mode and fail gracefully [14:01:10] we are not in that world :-) [14:01:42] addshore: this is the specific comment: https://phabricator.wikimedia.org/T187960#4997790 [14:01:57] And your answer below [14:02:06] okay, i just dont think my head can do this while in this meeting, will be back in 28 mins [14:02:13] haha [14:02:27] silly gerrit being down, bah [14:07:08] addshore: not down anymore [14:07:19] mutante: indeed [14:09:01] jynus: regarding your question, I agree maybe db1121? [14:09:23] that works too [14:09:33] let me amend! [14:09:39] the originals with 3:1 at least [14:10:00] do you want to merge this soon, so we don't get surprises? [14:10:02] you mean db1121 with 3 in soon? [14:10:05] yeah, I want to merge asap [14:10:25] 84 with 3, other with 1 [14:10:47] because api has to warmup a lot [14:10:52] (strange queries) [14:11:36] check the new patch [14:12:10] +1 [14:12:15] <3 [14:13:07] lets also prepare the master switchover [14:13:21] (ourselves) [14:14:28] the patches are ready [14:14:32] you need to review them :) [14:15:42] I mean the script and all that [14:16:43] ah [14:19:16] do you want to move the replication topology before hand? [14:19:45] no, just make sure you can run the script and all that [14:20:00] do you mind taking a look at that? [14:20:07] basically, that we have everything handy [14:20:12] ah [14:20:26] well, I can, but I prefer if we had to run it, it would be you [14:20:54] Then I need to hand you some other things [14:21:00] sure [14:21:01] I dont have enough hands :) [14:21:22] although if they call to the door, you will be on your own, just saying :-) [14:21:29] ok [14:22:07] switchover.py db1066 db1076 [14:22:26] just want to make sure you have it localized, just that [14:22:42] actually with [14:22:52] --skip-slave-move [14:23:05] it is on the docs [14:23:09] then we'd need to move the topology before hand [14:23:32] yeah, but only if needed [14:23:52] I just want to prepare for the worst [14:24:01] well, if we have to run the script it means it has failed and hence we need to move the slaves under the new one, that's why I removed the —skip-slave-move [14:24:03] even if it is unlikly [14:24:19] ok, the --slave-move doesn't work :-) [14:24:27] that is why it is compulsory [14:24:31] ah, ok [14:24:33] it will work, eventually [14:24:36] :-) [14:24:39] so if it fails we have to move them manually and then run the script [14:24:43] yep [14:24:45] ok [14:25:05] but still it is better than doing the 20 steps :-) [14:25:27] totally [14:25:37] and the movement will work soon (TM) [14:25:47] the other thing is that we may have to do a failover [14:25:57] which means before shutting down [14:26:04] go read only [14:26:26] get coords and, wait a few seconds for all replicas to catch [14:26:27] and grab the master status, that was my plan [14:26:30] yep [14:26:47] this is what I wanted to do, aligning and pen-testing all potential steps [14:26:52] just that :-) [14:27:11] let me finish the reviews [14:27:44] no connections on dbproxy1001 [14:28:15] nor on dbproxy1002 [14:28:26] good [14:28:51] I am guessing x1 on codfw will lag :-) [14:28:59] on switch [14:29:52] wait, isn't a hiera change needed? [14:30:09] not for x1 I think [14:30:12] but double check it [14:30:21] yeah, I want to check [14:31:51] yeah, I think it needs mariadb::mysql_role: 'master' [14:32:34] do you want me to amend? [14:33:01] (note the script makes heartbeat work anyway) [14:33:37] if you don't mind ammending, that'd be great [14:33:43] how far off are we? :) [14:34:14] From the other's channel question, it would be in 30 mins or so [14:34:18] ack [14:34:44] So, I think I will not set the whole wikis to readonly most (the 4x shards), I'll just run the maintenance script afterwards to cleanup the data [14:35:13] within cognate if DBReadOnlyErrors are thrown by the db abstraction (which they will be when x1 is readonly) cognate will just fail quietly (as jynus wants all the things to do this) [14:35:23] :) [14:35:25] he he [14:35:34] addshore: sure! [14:35:47] we added that nice quiet failure after cognate broke things when x1 had issues before :) [14:36:17] marostegui: https://gerrit.wikimedia.org/r/496723 [14:36:42] thanks [14:36:49] let me double check the s2 one [14:37:50] looking good [14:38:01] thanks many patches, easy to forget stuff [14:38:02] thanks for checking [14:43:35] depooling hosts [14:53:44] marostegui: please also ping me as your readonly x1 still, so i can watch for water spurting out anywhere [14:53:51] yep [14:53:53] willdo [14:53:54] ty [15:14:03] maybe time to install updates now? [15:14:16] yes, I was about to do it [15:14:24] you do then? [15:14:28] yep [15:14:31] ok [15:40:41] everything looks good although there was a spike in writes after the restart, probably from the job queue [15:42:49] I created https://phabricator.wikimedia.org/T218692 [15:43:28] small thing, not sure if the read only or the lack of connection [15:43:31] but thanks [15:44:05] yeah, just to track it [15:44:12] I know [15:44:21] and to see if we can get something better than "unknown error" [15:44:22] I think it is both [16:20:23] I am going to go foff [16:20:24] off [16:20:31] bye! [16:20:41] I will come back later to repool the hosts, or else, we can do it tomorrow morning anyways [16:20:59] I was going to do it mayself later [16:21:08] ah, that's great then! :) [16:21:17] I didn't know if you were planning to be around till the network maintenance was done [16:21:28] I will wait for it to finish first [16:22:04] Thanks :) [17:19:11] while deploying I saw report_users.sh, probably should be puppetized [19:19:20] 10Blocked-on-schema-change, 10Notifications, 10MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), 10Patch-For-Review, 10Schema-change: Remove unused bundling DB fields - https://phabricator.wikimedia.org/T143763 (10Etonkovidova) [19:23:18] 10Blocked-on-schema-change, 10Notifications, 10Patch-For-Review, 10Schema-change: Remove event_page_namespace and event_page_title - https://phabricator.wikimedia.org/T136427 (10Etonkovidova)