[07:01:26] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4034267 (10Marostegui) 05Open>03Resolved a:03Cmjohnson Thanks Chris, it looks good now: ``` root@db1064:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual... [07:23:25] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4034279 (10Marostegui) [07:23:36] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4034280 (10Marostegui) [07:23:59] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4034281 (10Marostegui) [07:39:03] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4034294 (10Marostegui) s4 eqiad progress [] db1102 [] labsdb1009 [] labsdb1010 [] labsdb1011 [x] dbstore1... [07:39:15] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4034295 (10Marostegui) s4 eqiad progress [] db1102 [] labsdb1009 [] labsdb1010 [] labsdb1011 [x] dbstore1001 (broken host... [07:39:18] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4034296 (10Marostegui) s4 eqiad progress [] db1102 [] labsdb1009 [] labsdb1010 [] labsdb1... [07:39:46] 10DBA, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), 10Patch-For-Review, 10Schema-change: Fix WMF schemas to not break when comment store goes WRITE_NEW - https://phabricator.wikimedia.org/T187089#4034297 (10Marostegui) [07:40:05] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Multi-Content-Revisions, and 3 others: Schema change to prepare for dropping archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T185128#4034300 (10Marostegui) [07:40:24] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Perform schema change to add externallinks.el_index_60 to all wikis - https://phabricator.wikimedia.org/T153182#4034301 (10Marostegui) [09:06:32] We will attempt to failover m5 master today at 15:30 UTC: https://phabricator.wikimedia.org/T189005#4032485 [09:20:42] ok [09:20:55] there is something wrong going on with db1114 https://logstash.wikimedia.org/goto/1b89fe4fa74634f8257f86795030ec53 [09:21:05] I will depool it and try to see what [09:21:34] :( [09:21:44] why sad? [09:21:58] Because it is a new host [09:22:20] so we love new hosts better than non-new ? :-D [09:22:35] I don't think it is the hard [09:22:44] it must be config or heartbeat or something [09:24:21] I am going to do it in 2 phases, to see if it is only main or api [09:24:38] or if only happens under high load or something [09:24:46] I am reviewing all the config files [09:25:19] yes, double check that in case there is a typo, but I think the host itself works ok [09:25:35] just for some reason it spits errors [09:25:40] related to replication [09:26:25] it could be some kind of incompatible change of 10.1 (strange) or something related to gtid [09:27:01] We have db1063 with 10.1 but it is a vslow, so it has less traffic [09:28:49] as expected, the graphs don't show any lag [09:30:37] no, but I think the load balancer cannot detect replication [09:31:26] or [09:31:31] it cannot even connect [09:31:39] Wikimedia\Rdbms\LoadBalancer::getRandomNonLagged: server db1114 is not replicating? [09:31:44] Wikimedia\Rdbms\LoadBalancer::doWait: failed to connect to db1114 [09:31:53] so it could be also permissions/password [09:32:15] ha [09:32:18] found it [09:32:22] did the jobrunner acccess from a new ip? [09:32:27] ERROR 1275 (HY000): Server is running in --secure-auth mode, but 'root'@'10.64.32.13' has a password in the old format; please change the password to the new format [09:32:28] the new [09:32:39] nope, that is root [09:32:49] but it could be related [09:32:53] but it might be also wikiadmin user [09:33:08] I checked and it has a right user for 10.64 [09:33:16] although I would expect to see that on the logs [09:33:18] it has the wrong for larger ranges [09:33:34] I don't know, let me depool first [09:33:48] and talk to joe about job queue [09:33:52] that would be my guess [09:35:00] but the requests seem to be api ones, not (only) job queue [09:35:08] so that would not fit perfectly [09:36:48] wikiadmin user can connect fine (from terbium) yes [09:37:01] same for wikiuser [09:37:31] it is something subtle, like some range being banned [09:37:38] maybe the iptables could be too [09:38:02] I tried a telnet from mw1209 and it worked fine [09:38:34] ok, so you are leaving only api traffic [09:38:47] just to test, I will do the opposite to see how it reacts [09:38:52] yeah yeah [09:38:57] good test [09:39:31] db2090 is doing the schema change, right? [09:39:49] yeah, all s4 codfw is lagging [09:39:55] I will ack it [09:39:56] because of the schema change [09:40:13] mmmm [09:40:26] it should've been silenced with my script [09:40:32] is it not on s4.hosts? [09:40:34] let's see [09:40:41] maybe it is new [09:41:12] yeah, it is not on s4.hosts [09:41:14] I will add it [09:42:01] https://gerrit.wikimedia.org/r/#/c/413439/ yep, it is new [09:43:03] I would have sweared that was pending [09:43:35] so it is not pending anymore :) [09:46:16] another thing it could be is differences in connection behaviour under high load [09:46:28] it is probably the only server wirth 10.1 and high load [09:46:34] yeah, that is correct [09:46:40] and it has >100 ongoing connections [09:46:49] as I said, db1063 is the other one with 10.1, but it is vslow in s6 [09:47:00] well, all the rcs [09:47:19] 100 is quite a lot, maybe connections fail because they take more on 10.1 [09:47:34] or need to be tuned (e.g. pool of connections) [09:47:41] to have the same response than 10.0 [09:47:45] under load [09:48:08] but we have left rc slaves on their own sometimes (ie: alter table on the other rc slave) [09:49:23] but still not so much load [09:49:36] even if less resources [09:50:08] that is my main thesis now because we have been stung by it when we migrated 5.5 -> 10.0 [09:50:20] oh really? [09:51:42] can you help me check if my change either broke or fix something? [09:51:54] the m5 one? [09:51:55] it is not clear to me now, but it gave me an error on deploy [09:52:01] no, I will not deploy that one [09:52:09] the partial depool [09:52:11] ah [09:52:15] let's seeee [09:52:45] you can resume on neodymium as needed [09:52:56] thanks [09:54:48] jynus: I cannot see anything on fatals to be honest [09:55:03] I think if any, the errors have reduced? [09:55:07] moritzm: thanks! [09:55:34] there was a small spike on db1114 [09:56:05] but probably becaues of the new connections to the other hosts [10:04:52] so errors to db1114 have reduced, but of course it has less traffic [10:05:37] reduced but not gone? [10:06:09] the last one is from 5 minutes ago (host not reachable) :| [10:06:27] from mw1337, I will check if I can connect from there [10:07:23] so a simple telnet work, so FW or network issues are not the problem [10:08:07] there is also coming from different mw hosts [10:10:39] so I will do the reverse test, we should have enough time to see past errors [10:11:16] yeah [10:12:11] sorry to drag you with this, but 10.1 compatibility- if it is that it is a quite important issue to debug [10:12:36] and technically it was causing errors, just not a lot of real impact [10:13:07] don't be sorry, this is very important and I actually pooled db1114 so it is technically my fault :) [10:13:17] no, it is not [10:13:35] maybe we can also try with db1063 [10:13:41] and give it some main traffic weight [10:13:46] and see if there are errors [10:13:51] if you remember, we both cheched it after riccardo pinged us [10:21:59] if you keep an eye on mediawiki errors and logs [10:22:13] I will check the metrics to see if there is any reason for it to be slower [10:22:20] or network errors [10:23:37] yep [10:23:42] will do that [10:23:58] there was an error at 10:20 [10:24:06] but you sycned at .22 [10:24:10] so let's discard that one for now [10:26:40] I can see the connection problems https://grafana.wikimedia.org/dashboard/db/mysql?panelId=10&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1114&var-port=9104&from=now-24h&to=now [10:27:12] nothing on logtash so far [10:27:24] 3.5 may not seem like a high value, but it is in failures/second [10:27:48] that would be 200 errors in a minute [10:29:04] I am going to compare it to db1080, which should be almost the same [10:29:56] the number of bytes written is much higher [10:30:21] 10-20MB/s [10:30:42] the iops is also much higher [10:30:59] marostegui: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1114&var-port=9104&from=now-24h&to=now [10:31:01] vs [10:31:10] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1080&var-port=9104 [10:31:45] even if there are query differences, writes should be similar [10:31:52] that is pretty strange [10:32:09] so it is not just me, that smells badly? [10:32:24] let's compare to other 10.1 replicas on enwiki [10:32:38] maybe we should wait a bit for the traffic to stabilize after your last change [10:32:51] but that is a 24-hour graph [10:33:04] it has been high for a long time [10:33:12] that is 24h vs 2days graph [10:33:13] but yeah [10:33:21] maybe we shouyld reimage the host with 10.0 [10:33:23] well, you get the idea [10:33:25] and see if the baheviour is the same [10:34:03] I am going to depool it for sure to make tests [10:34:23] +1 [10:34:23] will we be ok with only 2 api hosts? [10:34:43] could this be a kernel issue (latest patch?) [10:35:28] I would say we can reimage with 10.0 [10:35:31] mmm [10:35:32] to discard [10:35:39] writes are high for db1099, too [10:35:45] but it has probably less impact [10:35:58] could be 10.1 [10:37:16] another thing I see is an increase in writes since yesterday at 13:24 [10:37:27] could some depool happened at that time, which would be normal? [10:38:27] or you know, a wikidata deployment XD [10:38:36] haha [10:38:43] Switch all refreshLinks jobs to EventBus, file #2 - T185052 (duration: 01m 15s) [10:38:44] T185052: Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052 [10:38:50] I would suggest we reimnage and try 10.0 to discard or confirm issues at least [10:38:52] ^interesting [10:38:58] no, too soon [10:38:58] reimaging + cloning will take around 2-3 hours [10:39:14] let me depool this, and you can continue your work [10:39:23] it is not worth spending too people on this [10:39:37] i can take care of it [10:39:40] seriously [10:39:46] I pooled it on the first place [10:39:47] unless of course I cannot find anything and will ask you to take over [10:39:55] no, I like this, let me do it [10:39:59] ok :) [10:40:00] I found it [10:41:26] ???? [10:43:22] sorry [10:43:32] I found the increase in load, not sure if that is the source [10:43:44] see my conversation with joe [10:44:09] yeah, just read it [11:10:41] 10Blocked-on-schema-change, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, 10Wikidata, 10Schema-change: Deploy schema change for adding numeric primary key to wbqc_constraints table - https://phabricator.wikimedia.org/T189101#4034752 (10Lucas_Werkmeister_WMDE) [12:01:32] I am going for lunch, after that I will review the steps for the m5 failover for today at 15:30 UTC: https://phabricator.wikimedia.org/T189005#4032482 [12:02:14] ok, I will be back at 13, we can do it toghether [12:02:38] sure, we have to wait for andrew too :) [12:03:28] the review, I mean [12:03:35] sure :) [12:03:37] thanks! [12:12:38] scheduled jobs: Incremental Backup 10 09-Mar-18 04:05 es2001.codfw.wmnet-Monthly-1st-Mon-production-mysql-srv-backups-latest production00 [12:13:16] that is actually perfect timing [12:37:54] so, there are 2 issues, but I think one is affecting the other- db1114 is writing twice as much to disk than db1080 [12:38:52] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4034945 (10jcrespo) [12:39:11] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4034956 (10jcrespo) p:05Triage>03High [12:39:19] on top of the 2-3x increase because of the ongoing issue T189204 [12:39:19] T189204: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204 [13:20:01] but I would be db1114 writing twice... [13:20:04] weird [13:20:09] it cannot be 10.1 [13:20:13] Or I cannot believe it :) [13:25:29] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035008 (10Marostegui) [13:45:00] however, the other 10.1 hosts are writing that much, if not more (becaue they have 2 instances) but they are not as "loaded" [13:45:26] could be compression? [13:45:33] it is not compressed [13:45:39] db1114 isn't compressed I mean [13:46:34] but db1080 isn't either [13:46:38] so not taht [13:46:46] I was checking all the disks and everything [13:46:49] and HW related it looks good [13:47:11] while it could be hw, we wouln't see a difference on iops [13:47:19] well, we would, not not on more iops [13:47:24] true [13:47:49] the amount of writes in bytes should be around the same [13:47:56] what about kernels? [13:48:10] the rc slaves should have the same I think [13:48:12] let's see [13:48:22] db1114 has the latest [13:48:48] we could try to upgrade one of the rc slaves [13:48:50] to the lastest [13:48:59] I would actually do the opposite [13:49:03] a test downgrade [13:50:14] moritzm: we want to discard the latest kernel as the possible cause of io issues [13:50:35] is it ok to downgrade 1 host as a test? [13:50:50] sure [13:50:53] we could go for 4.9.65-3+deb9u1 which is installed [13:51:06] is that about https://phabricator.wikimedia.org/T189204? [13:51:17] kind of [13:51:21] not really, it is on top pf that [13:51:38] that is causing more issues on 1 host, which happens to be the one upgraded [13:51:49] 1114? [13:51:52] yes [13:51:53] yes [13:51:58] the issue exists aside from that [13:52:16] for for 1114 the connection failure is much higher [13:52:26] what did it run before, do we know that? [13:52:32] it is a new host [13:52:35] it was not in production [13:52:37] ah, ok [13:52:45] but we are comparing it with other of the exact same load [13:52:56] io pressure seems double [13:53:09] sure, feel free to downgrade and let me know if there's anything I can help with [13:53:16] it is unlikely it is the kernel, but it is the easies to test [13:53:26] 4.9.65-3+deb9u1 is installed, we can simply reboot and boot with that one [13:53:36] more than downgrading mariadb [13:53:59] as I said, marostegui don't worry about this for now [13:54:08] I will keep you updated [13:54:16] thank you :) [13:54:22] I prefer we work on different stuff, like you with the failover [13:54:37] sounds good! [13:54:42] yep, 4.9.65-3+deb9u1 is pre Meltdpwn and pre Spectre [13:54:44] don't worry, knowing me, I will bother you more [13:54:49] hahaha [13:54:51] :-) [14:06:58] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035074 (10Marostegui) [15:00:50] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035219 (10Marostegui) [15:04:19] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035238 (10Marostegui) [15:15:30] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035288 (10Andrew) [15:22:27] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035305 (10Andrew) [15:27:30] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035323 (10Marostegui) [15:28:11] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4027038 (10Marostegui) [15:30:26] 10DBA, 10Analytics, 10EventBus, 10MediaWiki-Database, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4035332 (10mobrovac) After [lowering the concurrency](https://gerrit.wikimedia.org/r/#/c/417270/) the number of new connections slightly... [15:31:04] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035335 (10Andrew) [15:32:23] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035337 (10Marostegui) [15:33:13] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4027038 (10Marostegui) [15:34:41] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035345 (10Marostegui) [15:38:11] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035352 (10Marostegui) [15:39:13] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035354 (10Andrew) [16:02:18] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035401 (10chasemp) In real time it was realized the ACL from labs-hosts VLAN was blocking access to the new m5 backing DB. > commit comment "T189005 nova... [16:07:56] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035404 (10Andrew) [16:11:31] 10DBA: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4035423 (10Marostegui) p:05Triage>03Normal [16:12:50] 10DBA: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4035423 (10Marostegui) [16:12:57] how did thing went? [16:13:16] are things happening now still? [16:14:12] it went all fine [16:15:53] is it over, can I help? [16:16:12] I was in a meeting and forgot about it [16:17:22] Don't worry [16:17:26] it is all done now :) [16:17:49] ok, thanks, good job! [16:20:46] did you know that db1009 had an uptime of 1028 days, making it the second-longers running mysql instance? [16:21:06] Yeah, I don't want to decommission it :-( [16:21:28] it is not that old, actually, it was the first server I setup on my own [16:21:39] I think you should decommission it then [16:21:44] ha ha [16:22:19] es1019 has now the new kernel btw, and 10.0.34, it had to be stopped to fix the idrac, so I upgraded it [16:22:27] just saying because I think it is the first es server that will have it [16:22:31] in case we notice issues [16:22:38] https://phabricator.wikimedia.org/T98958 [16:22:57] my suspicion is either kernel or stretch or 10.1 [16:23:05] T98958 -> and we are now around T189216 [16:23:05] T189216: Decommission db1009 - https://phabricator.wikimedia.org/T189216 [16:23:05] T98958: Reinstall db1009.eqiad from zero - https://phabricator.wikimedia.org/T98958 [16:23:07] probably kernel [16:23:30] marostegui: when I entered, phabricator was actually just installed [16:24:03] what was there before? [16:24:17] https://phabricator.wikimedia.org/T38 [16:24:30] hahaha T38 [16:24:30] T38: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38 [16:24:30] haha [16:24:39] https://phabricator.wikimedia.org/T1 [16:24:40] haha [16:24:48] rt was actually only used my ops, I think [16:24:57] bugzilla was used for bugs [16:25:08] I used rt during one of my clinic duties actually [16:25:22] and lots of other software, one per team [16:25:48] actually, T1 is not the older [16:25:49] T1: Get puppet runs into logstash - https://phabricator.wikimedia.org/T1 [16:26:02] bugzilla tickets were imported starting from T1000 [16:26:02] T1000: Update Beta Cluster status documentation (re Q3 intradepartamental priority) - https://phabricator.wikimedia.org/T1000 [16:26:08] or T10000 [16:26:08] T10000: Move @access public/private/protected qualifiers to PHP 5 keywords - https://phabricator.wikimedia.org/T10000 [16:26:15] actually https://phabricator.wikimedia.org/T2001 I believe [16:26:21] ah, thanks [16:26:25] since I did it :) [16:26:28] cool [16:26:32] haha [16:26:53] bugzilla…pfff I have used it loads and I prefer phabricator [16:26:58] yeah, it fits 2004 [16:27:02] the year [16:27:04] Although I have to say that at the start phabricator is quite overwhelming [16:27:33] oh, after a while, it is actually very cool [16:27:39] yeah, now I really like it [16:28:04] I think it is only cSS changes that confuses you [16:28:11] we collapsed 6+ tracking systems into phab at the time [16:28:16] teh biggest two were rt and bugzilla [16:28:34] chasemp: was rt used by non sysops? [16:28:48] it wasn't, and very few had access even [16:28:49] marostegui: you think rt is any good? [16:28:55] wmf-nda was originally just a group for accessing old RT things [16:29:03] becase oh my god [16:29:07] the interface [16:29:13] jynus: I used it like 10 years ago, don't know if has changed, but 10 years ago, I absolutely disliked it [16:29:22] it really hasn't changed :D [16:29:27] haha [16:29:29] I don't think it is disliking it [16:29:33] then, my comment stands still [16:29:34] I do not like gerrit [16:29:39] same [16:29:43] but after a while, it makes sense [16:29:56] it is ugly, but functional [16:30:01] at least the old skin [16:30:08] I haven't tested the new one [16:30:18] but rt is the oppsite of usability [16:30:20] RT's underbelly of admin config was also nightmare-ish aside form the normal user interface [16:30:27] except for emailing [16:30:49] does anyone know if otrs is any better [16:30:52] ? [16:31:06] it's similar in that it's an MTA task tracking system but it's really made differently [16:31:17] phab is for a known-quantity group of tech folks collaborating [16:31:20] yes, I would put rt on that group [16:31:24] otrs is more for anon reports and external bug reports [16:31:28] not on task handling [16:31:53] phab upstream has an otrs accept anon email reports style app in the works for a long time for their own purposes [16:31:59] i.e. they run phab as a service [16:32:20] any of you have used jira? [16:32:22] it's just a very different workflow where identity of parties involved has to be handled differently [16:32:28] yeah for years [16:32:33] and that whole ecosystem [16:32:36] oh, speaking of jira, I have to answer mariadb XD [16:32:44] Jira for me is horrible, I really don't like it [16:32:58] I was trying to remember what was percona's [16:33:01] jira is also a damn nightmare to administer [16:33:12] it was also email-handling mainly and customer facing [16:34:13] it was okish [16:34:21] I had the same opinion [16:34:30] the integration w/ confluence was nice (their wiki) [16:34:36] for emails, my #1 request is to be clear btween a comment [16:34:38] but many things about jira are whacky [16:34:42] and an email response [16:35:03] so many problems can come from not knowing the recipients [16:35:52] with our phabricator, that is not a problem, everthing is public :-) [16:36:22] jira interface is a nightmare [16:36:29] marostegui: one petition, before fully decomming db1009 [16:36:40] can you make a temporary archive of it? [16:36:50] yeah, I was doing it right now :) [16:37:00] oh, you are always 1 step ahead! [16:37:02] same thihng I did with the old tendril host [16:37:02] :-) [16:37:10] jynus: basically I don't trust myself :) [16:37:19] oh, for tendril I don't care [16:37:32] because technically no valuable data is there [17:02:36] 10DBA, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4035633 (10Marostegui) A backup of this host is placed at: `db1113:/srv/db1009_backup` [17:03:12] about to reboot db1114 [17:03:55] cool [17:04:31] 10DBA, 10Operations, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035640 (10Marostegui) [17:07:27] we should check if the kernel that db1114 is somewhere else, because I have been upgrading many kernels but mostly on jessie, but some hosts might have it (although, on jessie) [17:08:16] yeah, that is why I wanted to try this first [17:12:36] apparently, actually, the latest kernel wasn't installed yet [17:12:39] it was -5 [17:12:58] I am now on -4 [17:13:25] * marostegui crosses his fingers [17:14:29] it will get more io after reboot, so it will take hours to see the difference [17:14:38] yeah [17:14:48] probably tomorrow morning we will see it clearly [17:16:25] it is logging in row format, not sure if that is the default or not [17:17:06] it is for all the slaves but the candidate masters, no? [17:17:07] it is [17:17:13] so no difference there [17:17:22] no, only db1067 has statement on s1 [17:17:40] I was just trying to cover all possibilties [17:17:53] of course :) [17:18:28] it has cached up, so I expect now to have more reads than usual [17:18:40] caught up [17:18:55] but it should have mostly the same writes [17:20:02] strange, I see a decrease already [17:20:18] really? [17:20:30] if it is confirmed, I would reboot again with the other kernel, to confirm it [17:20:33] maybe on monday [17:20:36] yes [17:20:51] I will wait until tomowow [17:21:01] there could be things altering the graphs [17:21:08] yep [17:21:19] but it is supicios already [17:21:27] it could be a reboot cause [17:21:38] some process logging quickly or something [17:23:00] half as much io write in bytes: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=33&fullscreen&orgId=1&from=1520526165150&to=1520529765150&var-dc=eqiad%20prometheus%2Fops&var-server=db1114&var-port=9104 [17:23:19] or the same than the other hosts, basically [17:23:23] that huge spike is the mysql stop, no? [17:23:29] yes [17:23:45] and the small one is the mysql start [17:24:14] let me see the swapping [17:25:46] swap 0, so it wasn't that [17:27:53] I will repool it later, maybe it is missleading without load [17:28:05] yeah [17:28:07] good test [17:41:19] 10DBA, 10Patch-For-Review: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4035723 (10Marostegui) [17:41:21] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4035724 (10Marostegui) [22:52:47] 10DBA, 10Wikimedia-Site-requests: Some slaves are impossible to check for replication lag in MediaWiki - https://phabricator.wikimedia.org/T189263#4036935 (10MaxSem)