[00:11:27] <wikibugs_>	 10DBA, 10MediaWiki-Database, 10MediaWiki-Platform-Team, 10Availability, 10Performance-Team (Radar): wfWaitForSlaves in JobRunner can massively slow down run rate if just a single slave is lagged - https://phabricator.wikimedia.org/T95799#1200679 (10Krinkle)
[01:23:47] <wikibugs_>	 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4159963 (10Peachey88)
[05:19:51] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4160111 (10Marostegui) s2 eqiad progress:  [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1102 [] dbstore...
[05:19:57] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4160112 (10Marostegui) s2 eqiad progress:  [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1102 [...
[05:20:09] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4160113 (10Marostegui) s2 eqiad progress:  [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1102 [] dbstore1002 [...
[05:20:39] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4160115 (10Marostegui)
[05:20:57] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4160116 (10Marostegui)
[05:21:18] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4160117 (10Marostegui)
[06:24:27] <jynus>	 -rw-rw---- 1 mysql mysql 82G Apr 26 06:21 logging.ibd
[06:24:35] <jynus>	 vs  -rw-rw---- 1 mysql mysql 217G Apr 25 10:20 logging.ibd
[06:25:24] <jynus>	 I think purging got accelerated since alter finished
[06:25:36] <jynus>	 https://grafana.wikimedia.org/dashboard/db/mysql?panelId=11&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2001&var-port=13318&from=1524551131210&to=1524723931210
[06:30:08] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4160185 (10jcrespo) Please sync with me on s2, as db1054 is likely to be decomm'ed very soon, an...
[06:48:31] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375#4160200 (10Urbanecm) 05Resolved>03Open Something went wrong. DNS {wikicode}.analytics.db.svc.eqiad.wmflabs and {wikicode}.web.db.svc.eqiad.wm...
[06:48:41] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112#4160203 (10Urbanecm) 05Resolved>03Open Something went wrong. DNS {wikicode}.analytics.db.svc.eqiad.wmflabs and {wikicode}.web.db.svc.eqiad.wm...
[06:49:57] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4160207 (10Urbanecm) 05Resolved>03Open Something went wrong. DNS {wikicode}.analytics.db.svc.eqiad.wmflabs and {wikicode}.web.db.svc.eqiad.wmflabs are...
[06:50:40] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774#4160210 (10Urbanecm) 05Resolved>03Open Something went wrong. DNS {wikicode}.analytics.db.svc.eqiad.wmflabs and {wikicode}.web.db.svc.eq...
[06:56:20] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375#4160214 (10Urbanecm) Just to make you sure, this isn't working on web replicas as well. See the following output.   ``` urbanecm@tools-bastion-02...
[06:56:23] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112#4160215 (10Urbanecm) Just to make you sure, this isn't working on web replicas as well. See the following output.   ``` urbanecm@tools-bastion-02...
[06:56:44] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4160216 (10Urbanecm) Just to make you sure, this isn't working on web replicas as well. See the following output.   ``` urbanecm@tools-bastion-02 ~  $ mysq...
[06:56:46] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774#4160217 (10Urbanecm) Just to make you sure, this isn't working on web replicas as well. See the following output.   ``` urbanecm@tools-bast...
[07:53:42] <marostegui>	 jynus: if I start with the rc slaves in s2 that'd be fine with you?
[07:53:53] <jynus>	 yes
[07:54:01] <jynus>	 I have depooled db1090, though
[07:54:14] <marostegui>	 sure, won't start till the evening probably
[07:54:16] <jynus>	 just moved it to multi-instance
[07:54:22] <marostegui>	 I will ping you before starting
[07:54:24] <jynus>	 I was actually suggesting to do that first
[07:54:42] <jynus>	 so I can pool it as soon as it finishes
[07:54:53] <marostegui>	 to pool which one?
[07:55:01] <jynus>	 db1090
[07:55:05] <jynus>	 it is depooled
[07:55:11] <marostegui>	 So you want me to do that first?
[07:55:17] <jynus>	 yes, if you can
[07:55:19] <marostegui>	 sure
[07:55:32] <marostegui>	 I guess it will take 2-3 hours (I guess)
[07:55:45] <jynus>	 it is ok
[07:55:53] <jynus>	 but I prefer it to having 2 hosts depooled
[07:56:11] <marostegui>	 sure, ok, going to run it
[07:56:25] <jynus>	 meanwhile I can work with db1070:s7
[07:56:31] <marostegui>	 yep
[07:56:31] <jynus>	 *db1090:s7
[07:56:48] <marostegui>	 db1090:s7 ?
[07:57:08] <jynus>	 does it bother you?
[07:57:14] <marostegui>	 no, I am not touching s7
[07:57:34] <jynus>	 I will not touch s2, but I can load s7 at the same time
[07:57:40] <marostegui>	 sure
[07:58:46] <jynus>	 then there is db1060, which has to disappear
[07:59:04] <jynus>	 so let me know what do you prefer to do?
[07:59:22] <marostegui>	 probably move it to x1
[07:59:27] <jynus>	 no
[07:59:27] <marostegui>	 so we can unblock the switch thingy
[07:59:29] <jynus>	 I mean
[07:59:36] <jynus>	 db1074 convert to row?
[07:59:43] <jynus>	 and move sanitariums there?
[08:00:07] <jynus>	 db1060 has to literally go
[08:00:29] <jynus>	 db1069 can go to x1
[08:00:45] <marostegui>	 Yeah, db1074 to row (and multi-instance?)
[08:00:57] <jynus>	 why multiinstance?
[08:01:08] <marostegui>	 db1060 is vslow, which vslow will you place there?
[08:01:11] <jynus>	 it is an api/main
[08:01:18] <jynus>	 db1090
[08:01:21] <jynus>	 :s2
[08:01:31] <marostegui>	 ah, ok
[08:01:41] <marostegui>	 then yeah, db1074 as sanitarium master + db1090 as vslow sounds good
[08:02:04] <marostegui>	 We could also use db1090 as sanitarium master
[08:02:31] <jynus>	 next I was going to depool db1069 and load it into db1090:s7
[08:03:55] <marostegui>	 sounds good
[08:04:06] <jynus>	 note db1054 has to go, too
[08:05:40] <marostegui>	 yeah
[08:06:12] <jynus>	 let me do the things we know are ok to do (get db1069)
[08:06:26] <jynus>	 and then we can scratch a full plan
[08:06:28] <marostegui>	 sure
[08:06:30] <marostegui>	 https://gerrit.wikimedia.org/r/#/c/429150/ 
[08:06:31] <marostegui>	 can you check it?
[08:06:43] <wikibugs_>	 10DBA, 10Dumps-Generation: Some dump hosts are accessing main traffic servers - https://phabricator.wikimedia.org/T143870#4160278 (10ArielGlenn) As far as I knew, this was a wikidata-only issue, though I have periodically checked this task to see if there is any more information.  The zhwiki task, so that we h...
[08:07:05] <jynus>	 why pool it?
[08:07:11] <jynus>	 it shouldn't be a mediawiki host
[08:07:25] <marostegui>	 Yeah, but it is replicating s1, and I use the config files to check the alter tables done and pending
[08:07:45] <jynus>	 it shouldn't be known to mediawiki
[08:08:07] <marostegui>	 it will be a core host in the future, so there is no harm in adding it now
[08:08:21] <jynus>	 I disagree
[08:08:39] <marostegui>	 what's the harm in adding it now?
[08:09:06] <jynus>	 what is the harm on adding labsdb1009?
[08:09:19] <marostegui>	 labsdb1009 will never be a core host
[08:09:40] <jynus>	 so will not db1116, unless reimaged
[08:09:57] <marostegui>	 db1116 will be as soon as we get the sanitarium definitive hosts
[08:10:09] <jynus>	 but it will have to be reimaged
[08:10:15] <marostegui>	 yes
[08:10:31] <jynus>	 why do you want to add it, it is not a mediawiki host?
[08:10:52] <jynus>	 only core hosts get there
[08:10:56] <marostegui>	 ok, fine
[08:11:31] <jynus>	 you should use dbhosts to control dbs pending to be reimaged
[08:11:53] <marostegui>	 ok, I have abandoned it. This is not that important to spend more time discussing about it
[08:20:48] <jynus>	 the whole point of etcd is to make mediawiki config not the source of truth, but took it away from it
[08:21:05] <jynus>	 we want less things on those files, not more
[08:39:16] <jynus>	 I don't know what I am doing https://gerrit.wikimedia.org/r/#/c/429153/
[08:40:31] <marostegui>	 that looks good
[08:41:10] <jynus>	 but db1069 is also the alternative master
[08:41:15] <jynus>	 so not sure it is ok to do that
[08:42:00] <jynus>	 when it is moved to x1, how should we leave s7?
[08:42:13] <jynus>	 assuming I just add db1090:s7 there?
[08:43:03] <marostegui>	 We can have db1090 as sanitarium master and db1079 as candidate master
[08:43:13] <jynus>	 it is ok to leave a candidate master down for a long time?
[08:43:26] <jynus>	 should we convert some other host to statement beforehand?
[08:43:27] <marostegui>	 I would think so
[08:43:59] <jynus>	 but 79 is row because sanitarium
[08:44:06] <jynus>	 this is a headache
[08:44:27] <marostegui>	 then maybe db1086 should be the new candidate master
[08:44:42] <marostegui>	 it is in a different row and rack
[08:45:10] <jynus>	 so, probably I should depool it first, move it to statement
[08:45:15] <jynus>	 and then do the above commit
[08:45:37] <marostegui>	 yeah, let's do that
[09:59:10] <wikibugs_>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4160437 (10Marostegui)
[09:59:40] <wikibugs_>	 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3396309 (10Marostegui)
[10:04:28] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375#4160440 (10MarcoAurelio) Confirmed. Happening to me as well for all of the newly created wikis.
[10:30:42] <jynus>	 manuel you don't connect?
[10:31:09] <marostegui>	 ah the meeting
[10:31:14] <marostegui>	 sorry, I forgot
[11:14:50] <marostegui>	 jynus: I am done with db1090:3312 you can repool it whenever you like
[11:15:35] <jynus>	 thanks
[11:15:43] <jynus>	 I will take care of all of that
[11:15:50] <marostegui>	 let me know which one I can do next (no rush)
[11:15:53] <marostegui>	 I am going for lunch!
[11:16:02] <jynus>	 I have yet to work with db1090:3317
[11:17:15] <marostegui>	 Sure, no problem. I won't touch any host in s2 unless you give me green light :)
[11:17:26] <jynus>	 I will reimage as stretch db1086 actually
[11:17:37] <jynus>	 all other hosts gave no errors
[11:18:31] <jynus>	 maybe I am over-thinking, but I would do a quick compare.py run on db1060 and db1066
[11:35:03] <Hauskatze>	 jynus / marostegui https://phabricator.wikimedia.org/T184375#4160200 that you can fix?
[12:22:59] <marostegui>	 That's for Clouds team probably to fix
[12:23:16] <marostegui>	 Let me check the roles and all that
[12:35:33] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112#4161031 (10Marostegui) 05Open>03Resolved This is fixed: ``` marostegui@tools-bastion-03:~$ sql --cluster analytics gorwiki_p Reading table in...
[12:36:07] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774#4161034 (10Marostegui) 05Open>03Resolved This is fixed: ``` marostegui@tools-bastion-03:~$ sql --cluster analytics gorwiki_p Reading ta...
[12:36:51] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375#4161039 (10Marostegui) 05Open>03Resolved This is fixed: ``` marostegui@tools-bastion-03:~$ sql --cluster analytics gorwiki_p Reading table in...
[12:37:23] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4161043 (10Marostegui) 05Open>03Resolved This is fixed: ``` marostegui@tools-bastion-03:~$ sql --cluster analytics gorwiki_p Reading table information...
[12:38:42] <wikibugs_>	 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): `pr_index` to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#4161058 (10Marostegui)
[12:45:19] <marostegui>	 jynus: let me know if I can start with db1103 or db1105 (rc slaves in s2)
[12:46:06] <jynus>	 oh, the rest of s2 can be donw without telling me
[12:46:14] <jynus>	 just don't do db1060
[12:46:21] <marostegui>	 ah cool!
[12:46:23] <jynus>	 because it will be decommisioned
[12:46:25] <marostegui>	 I will start with rc slaves then :)
[12:46:26] <marostegui>	 thanks
[13:01:34] <volans>	 FYI I've reset the Force PXE from all hosts where it was set, including a bunch of DBs, and es1019 has *again* IPMI issues: T193155
[13:01:34] <stashbot>	 T193155: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155
[13:02:44] <jynus>	 again?
[13:04:01] <volans>	 according to T187530, T155691 and T167121
[13:04:02] <stashbot>	 T187530: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530
[13:04:02] <stashbot>	 T155691: es1019.eqiad.wmnet drac unresponsive - https://phabricator.wikimedia.org/T155691
[13:04:02] <stashbot>	 T167121: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121
[13:04:06] <jynus>	 volans: we have a failsafe, BTW for the databases
[13:04:17] <jynus>	 no, I know it had those in the past
[13:04:30] <jynus>	 I was saying again as in, why an extra time?
[13:05:00] <jynus>	 default partman recipe fails so we require a puppet change to reimage a db host
[13:05:03] <volans>	 again compared to the previous ones, is not normal that the IPMI breaks so easily and often
[13:05:36] <jynus>	 we did that after we lost, I think es1019 accidentally
[13:06:17] <volans>	 yeah I know
[13:06:43] <jynus>	 and probably not a bad policy for stateful services
[13:18:19] <wikibugs_>	 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345, 10cloud-services-team (Kanban): Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4161161 (10Urbanecm) 05Resolved>03Open It do not work correctly.   ``` urbanecm@tools-bastion-03 ~   $ sql hiwiki...
[13:31:13] <marostegui>	 ^ I will get that one fixed too
[13:31:50] <jynus>	 role?
[13:32:41] <marostegui>	 yep
[13:33:08] <jynus>	 we should tell cloud to document it after creation
[13:33:21] <jynus>	 probably if it is done before, it fails because database doesn't exist
[13:33:25] <wikibugs_>	 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345, 10cloud-services-team (Kanban): Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4161182 (10Marostegui) 05Open>03Resolved Should be fixed now  ``` marostegui@tools-bastion-03:~$ sql --cluster we...
[13:33:29] <marostegui>	 yep, I was going to ping brooke, as she was the one taking care of those tasks
[13:34:20] <jynus>	 I did STOP SLAVE; SET GLOBAL binlog_format=STATEMENT; FLUSH BINARY LOGS; FLUSH RELAY LOGS; START SLAVE;
[13:34:25] <jynus>	 on db1086
[13:34:41] <marostegui>	 and worked?
[13:34:54] <jynus>	 it is difficult to see if it was on mixed
[13:36:26] <jynus>	 not sure what to check to verify it other than "there is no row based records on the binlog"
[13:37:13] <marostegui>	 I was mistaken in our previous meeting, switch for row D wasn't migrated yet
[13:37:16] <marostegui>	 It was, but on codfw
[13:37:19] <jynus>	 lol
[13:37:24] <marostegui>	 that's what confused me :)
[13:37:46] <marostegui>	 let's stick to the plan of upgrading first C, as it is easier
[13:37:50] <jynus>	 yes
[13:38:28] <jynus>	 all events on older logs seem to be on row, non on the new one
[13:38:33] <jynus>	 I think that should be enough
[13:38:48] <jynus>	 for "it doesn't apparently break"
[13:38:59] <marostegui>	 in theory it should be fine, but we all know the theory..
[13:39:06] <jynus>	 I guess if it was writing actively
[13:39:11] <jynus>	 ongoing sessions
[13:39:16] <jynus>	 it would take more time
[13:39:24] <jynus>	 e.g. an active master
[13:39:45] <marostegui>	 yeah, that'd be a bit more scary to do
[13:39:47] <jynus>	 it is not scheduled to be a master, so
[13:39:57] <jynus>	 we can trust it until something is strange
[13:40:34] <jynus>	 I think testing it is better than doing things irrationally
[13:40:47] <jynus>	 as long as it is a safe test
[13:41:08] <jynus>	 I was going to reimage it, but then realized it is a candidate master, so it will be the last set of servers to reimage
[13:41:17] <jynus>	 so repooling it
[13:46:15] <marostegui>	 oki
[13:48:43] <jynus>	 this is going to be a 5-ball game
[13:49:12] <jynus>	 db1122 -> db1090 -> db1089 -> db1069 -> db1056
[13:49:25] <marostegui>	 db1089?
[13:49:30] <marostegui>	 you mean db1086?
[13:49:38] <jynus>	 yeah, I am confusing it all the time
[13:49:48] <jynus>	 and in the end I will create an outage, you'll see
[13:50:01] <marostegui>	 haha
[13:50:11] <marostegui>	 I know db1089 because I always use it first in s1
[13:50:13] <marostegui>	 XD
[13:50:14] <jynus>	 I am doing commits like https://gerrit.wikimedia.org/r/#/c/429182/
[13:50:24] <jynus>	 so I may go away today soon
[13:50:34] <marostegui>	 hahaha
[13:50:46] <marostegui>	 get some more cocacola!
[13:56:46] <jynus>	 since atop was disabled, very, very few connection errors on databases
[13:57:52] <marostegui>	 \o/
[16:45:47] <wikibugs_>	 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for idwikimedia - https://phabricator.wikimedia.org/T193187#4161962 (10Urbanecm)
[16:49:15] <wikibugs_>	 10DBA, 10Patch-For-Review: Productionize 8 eqiad hosts - https://phabricator.wikimedia.org/T192979#4161987 (10jcrespo) Last update about productionization:  * db1090 should be able to be repooled soon and so remove db1060 (decom) and db1069 (x1) from production, but I left it being compressed on 2 local screen...