[00:11:27] 10DBA, 10MediaWiki-Database, 10MediaWiki-Platform-Team, 10Availability, 10Performance-Team (Radar): wfWaitForSlaves in JobRunner can massively slow down run rate if just a single slave is lagged - https://phabricator.wikimedia.org/T95799#1200679 (10Krinkle) [01:23:47] 10DBA, 10Operations, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4159963 (10Peachey88) [05:19:51] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4160111 (10Marostegui) s2 eqiad progress: [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1102 [] dbstore... [05:19:57] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4160112 (10Marostegui) s2 eqiad progress: [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1102 [... [05:20:09] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4160113 (10Marostegui) s2 eqiad progress: [] labsdb1009 [] labsdb1010 [] labsdb1011 [] db1102 [] dbstore1002 [... [05:20:39] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Ministry-Of-Magic: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519#4160115 (10Marostegui) [05:20:57] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4160116 (10Marostegui) [05:21:18] 10Blocked-on-schema-change, 10DBA, 10Multi-Content-Revisions, 10Patch-For-Review, 10User-Addshore: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148#4160117 (10Marostegui) [06:24:27] -rw-rw---- 1 mysql mysql 82G Apr 26 06:21 logging.ibd [06:24:35] vs -rw-rw---- 1 mysql mysql 217G Apr 25 10:20 logging.ibd [06:25:24] I think purging got accelerated since alter finished [06:25:36] https://grafana.wikimedia.org/dashboard/db/mysql?panelId=11&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2001&var-port=13318&from=1524551131210&to=1524723931210 [06:30:08] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10MediaWiki-Platform-Team (MWPT-Q4-Apr-Jun-2018), 10Patch-For-Review: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299#4160185 (10jcrespo) Please sync with me on s2, as db1054 is likely to be decomm'ed very soon, an... [06:48:31] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375#4160200 (10Urbanecm) 05Resolved>03Open Something went wrong. DNS {wikicode}.analytics.db.svc.eqiad.wmflabs and {wikicode}.web.db.svc.eqiad.wm... [06:48:41] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112#4160203 (10Urbanecm) 05Resolved>03Open Something went wrong. DNS {wikicode}.analytics.db.svc.eqiad.wmflabs and {wikicode}.web.db.svc.eqiad.wm... [06:49:57] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4160207 (10Urbanecm) 05Resolved>03Open Something went wrong. DNS {wikicode}.analytics.db.svc.eqiad.wmflabs and {wikicode}.web.db.svc.eqiad.wmflabs are... [06:50:40] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774#4160210 (10Urbanecm) 05Resolved>03Open Something went wrong. DNS {wikicode}.analytics.db.svc.eqiad.wmflabs and {wikicode}.web.db.svc.eq... [06:56:20] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375#4160214 (10Urbanecm) Just to make you sure, this isn't working on web replicas as well. See the following output. ``` urbanecm@tools-bastion-02... [06:56:23] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112#4160215 (10Urbanecm) Just to make you sure, this isn't working on web replicas as well. See the following output. ``` urbanecm@tools-bastion-02... [06:56:44] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4160216 (10Urbanecm) Just to make you sure, this isn't working on web replicas as well. See the following output. ``` urbanecm@tools-bastion-02 ~ $ mysq... [06:56:46] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774#4160217 (10Urbanecm) Just to make you sure, this isn't working on web replicas as well. See the following output. ``` urbanecm@tools-bast... [07:53:42] jynus: if I start with the rc slaves in s2 that'd be fine with you? [07:53:53] yes [07:54:01] I have depooled db1090, though [07:54:14] sure, won't start till the evening probably [07:54:16] just moved it to multi-instance [07:54:22] I will ping you before starting [07:54:24] I was actually suggesting to do that first [07:54:42] so I can pool it as soon as it finishes [07:54:53] to pool which one? [07:55:01] db1090 [07:55:05] it is depooled [07:55:11] So you want me to do that first? [07:55:17] yes, if you can [07:55:19] sure [07:55:32] I guess it will take 2-3 hours (I guess) [07:55:45] it is ok [07:55:53] but I prefer it to having 2 hosts depooled [07:56:11] sure, ok, going to run it [07:56:25] meanwhile I can work with db1070:s7 [07:56:31] yep [07:56:31] *db1090:s7 [07:56:48] db1090:s7 ? [07:57:08] does it bother you? [07:57:14] no, I am not touching s7 [07:57:34] I will not touch s2, but I can load s7 at the same time [07:57:40] sure [07:58:46] then there is db1060, which has to disappear [07:59:04] so let me know what do you prefer to do? [07:59:22] probably move it to x1 [07:59:27] no [07:59:27] so we can unblock the switch thingy [07:59:29] I mean [07:59:36] db1074 convert to row? [07:59:43] and move sanitariums there? [08:00:07] db1060 has to literally go [08:00:29] db1069 can go to x1 [08:00:45] Yeah, db1074 to row (and multi-instance?) [08:00:57] why multiinstance? [08:01:08] db1060 is vslow, which vslow will you place there? [08:01:11] it is an api/main [08:01:18] db1090 [08:01:21] :s2 [08:01:31] ah, ok [08:01:41] then yeah, db1074 as sanitarium master + db1090 as vslow sounds good [08:02:04] We could also use db1090 as sanitarium master [08:02:31] next I was going to depool db1069 and load it into db1090:s7 [08:03:55] sounds good [08:04:06] note db1054 has to go, too [08:05:40] yeah [08:06:12] let me do the things we know are ok to do (get db1069) [08:06:26] and then we can scratch a full plan [08:06:28] sure [08:06:30] https://gerrit.wikimedia.org/r/#/c/429150/ [08:06:31] can you check it? [08:06:43] 10DBA, 10Dumps-Generation: Some dump hosts are accessing main traffic servers - https://phabricator.wikimedia.org/T143870#4160278 (10ArielGlenn) As far as I knew, this was a wikidata-only issue, though I have periodically checked this task to see if there is any more information. The zhwiki task, so that we h... [08:07:05] why pool it? [08:07:11] it shouldn't be a mediawiki host [08:07:25] Yeah, but it is replicating s1, and I use the config files to check the alter tables done and pending [08:07:45] it shouldn't be known to mediawiki [08:08:07] it will be a core host in the future, so there is no harm in adding it now [08:08:21] I disagree [08:08:39] what's the harm in adding it now? [08:09:06] what is the harm on adding labsdb1009? [08:09:19] labsdb1009 will never be a core host [08:09:40] so will not db1116, unless reimaged [08:09:57] db1116 will be as soon as we get the sanitarium definitive hosts [08:10:09] but it will have to be reimaged [08:10:15] yes [08:10:31] why do you want to add it, it is not a mediawiki host? [08:10:52] only core hosts get there [08:10:56] ok, fine [08:11:31] you should use dbhosts to control dbs pending to be reimaged [08:11:53] ok, I have abandoned it. This is not that important to spend more time discussing about it [08:20:48] the whole point of etcd is to make mediawiki config not the source of truth, but took it away from it [08:21:05] we want less things on those files, not more [08:39:16] I don't know what I am doing https://gerrit.wikimedia.org/r/#/c/429153/ [08:40:31] that looks good [08:41:10] but db1069 is also the alternative master [08:41:15] so not sure it is ok to do that [08:42:00] when it is moved to x1, how should we leave s7? [08:42:13] assuming I just add db1090:s7 there? [08:43:03] We can have db1090 as sanitarium master and db1079 as candidate master [08:43:13] it is ok to leave a candidate master down for a long time? [08:43:26] should we convert some other host to statement beforehand? [08:43:27] I would think so [08:43:59] but 79 is row because sanitarium [08:44:06] this is a headache [08:44:27] then maybe db1086 should be the new candidate master [08:44:42] it is in a different row and rack [08:45:10] so, probably I should depool it first, move it to statement [08:45:15] and then do the above commit [08:45:37] yeah, let's do that [09:59:10] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#4160437 (10Marostegui) [09:59:40] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3396309 (10Marostegui) [10:04:28] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375#4160440 (10MarcoAurelio) Confirmed. Happening to me as well for all of the newly created wikis. [10:30:42] manuel you don't connect? [10:31:09] ah the meeting [10:31:14] sorry, I forgot [11:14:50] jynus: I am done with db1090:3312 you can repool it whenever you like [11:15:35] thanks [11:15:43] I will take care of all of that [11:15:50] let me know which one I can do next (no rush) [11:15:53] I am going for lunch! [11:16:02] I have yet to work with db1090:3317 [11:17:15] Sure, no problem. I won't touch any host in s2 unless you give me green light :) [11:17:26] I will reimage as stretch db1086 actually [11:17:37] all other hosts gave no errors [11:18:31] maybe I am over-thinking, but I would do a quick compare.py run on db1060 and db1066 [11:35:03] jynus / marostegui https://phabricator.wikimedia.org/T184375#4160200 that you can fix? [12:22:59] That's for Clouds team probably to fix [12:23:16] Let me check the roles and all that [12:35:33] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112#4161031 (10Marostegui) 05Open>03Resolved This is fixed: ``` marostegui@tools-bastion-03:~$ sql --cluster analytics gorwiki_p Reading table in... [12:36:07] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774#4161034 (10Marostegui) 05Open>03Resolved This is fixed: ``` marostegui@tools-bastion-03:~$ sql --cluster analytics gorwiki_p Reading ta... [12:36:51] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375#4161039 (10Marostegui) 05Open>03Resolved This is fixed: ``` marostegui@tools-bastion-03:~$ sql --cluster analytics gorwiki_p Reading table in... [12:37:23] 10DBA, 10Cloud-Services, 10User-Urbanecm, 10cloud-services-team (Kanban): Prepare storage layer for lfnwiki - https://phabricator.wikimedia.org/T183566#4161043 (10Marostegui) 05Open>03Resolved This is fixed: ``` marostegui@tools-bastion-03:~$ sql --cluster analytics gorwiki_p Reading table information... [12:38:42] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): `pr_index` to be replicated to Labs public databases - https://phabricator.wikimedia.org/T113842#4161058 (10Marostegui) [12:45:19] jynus: let me know if I can start with db1103 or db1105 (rc slaves in s2) [12:46:06] oh, the rest of s2 can be donw without telling me [12:46:14] just don't do db1060 [12:46:21] ah cool! [12:46:23] because it will be decommisioned [12:46:25] I will start with rc slaves then :) [12:46:26] thanks [13:01:34] FYI I've reset the Force PXE from all hosts where it was set, including a bunch of DBs, and es1019 has *again* IPMI issues: T193155 [13:01:34] T193155: IPMI Audit 2018-04 - https://phabricator.wikimedia.org/T193155 [13:02:44] again? [13:04:01] according to T187530, T155691 and T167121 [13:04:02] T187530: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530 [13:04:02] T155691: es1019.eqiad.wmnet drac unresponsive - https://phabricator.wikimedia.org/T155691 [13:04:02] T167121: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121 [13:04:06] volans: we have a failsafe, BTW for the databases [13:04:17] no, I know it had those in the past [13:04:30] I was saying again as in, why an extra time? [13:05:00] default partman recipe fails so we require a puppet change to reimage a db host [13:05:03] again compared to the previous ones, is not normal that the IPMI breaks so easily and often [13:05:36] we did that after we lost, I think es1019 accidentally [13:06:17] yeah I know [13:06:43] and probably not a bad policy for stateful services [13:18:19] 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345, 10cloud-services-team (Kanban): Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4161161 (10Urbanecm) 05Resolved>03Open It do not work correctly. ``` urbanecm@tools-bastion-03 ~ $ sql hiwiki... [13:31:13] ^ I will get that one fixed too [13:31:50] role? [13:32:41] yep [13:33:08] we should tell cloud to document it after creation [13:33:21] probably if it is done before, it fails because database doesn't exist [13:33:25] 10DBA, 10Cloud-Services, 10Hindi-Sites, 10User-Jayprakash12345, 10cloud-services-team (Kanban): Prepare and check storage layer for hiwikimedia - https://phabricator.wikimedia.org/T188490#4161182 (10Marostegui) 05Open>03Resolved Should be fixed now ``` marostegui@tools-bastion-03:~$ sql --cluster we... [13:33:29] yep, I was going to ping brooke, as she was the one taking care of those tasks [13:34:20] I did STOP SLAVE; SET GLOBAL binlog_format=STATEMENT; FLUSH BINARY LOGS; FLUSH RELAY LOGS; START SLAVE; [13:34:25] on db1086 [13:34:41] and worked? [13:34:54] it is difficult to see if it was on mixed [13:36:26] not sure what to check to verify it other than "there is no row based records on the binlog" [13:37:13] I was mistaken in our previous meeting, switch for row D wasn't migrated yet [13:37:16] It was, but on codfw [13:37:19] lol [13:37:24] that's what confused me :) [13:37:46] let's stick to the plan of upgrading first C, as it is easier [13:37:50] yes [13:38:28] all events on older logs seem to be on row, non on the new one [13:38:33] I think that should be enough [13:38:48] for "it doesn't apparently break" [13:38:59] in theory it should be fine, but we all know the theory.. [13:39:06] I guess if it was writing actively [13:39:11] ongoing sessions [13:39:16] it would take more time [13:39:24] e.g. an active master [13:39:45] yeah, that'd be a bit more scary to do [13:39:47] it is not scheduled to be a master, so [13:39:57] we can trust it until something is strange [13:40:34] I think testing it is better than doing things irrationally [13:40:47] as long as it is a safe test [13:41:08] I was going to reimage it, but then realized it is a candidate master, so it will be the last set of servers to reimage [13:41:17] so repooling it [13:46:15] oki [13:48:43] this is going to be a 5-ball game [13:49:12] db1122 -> db1090 -> db1089 -> db1069 -> db1056 [13:49:25] db1089? [13:49:30] you mean db1086? [13:49:38] yeah, I am confusing it all the time [13:49:48] and in the end I will create an outage, you'll see [13:50:01] haha [13:50:11] I know db1089 because I always use it first in s1 [13:50:13] XD [13:50:14] I am doing commits like https://gerrit.wikimedia.org/r/#/c/429182/ [13:50:24] so I may go away today soon [13:50:34] hahaha [13:50:46] get some more cocacola! [13:56:46] since atop was disabled, very, very few connection errors on databases [13:57:52] \o/ [16:45:47] 10DBA, 10Cloud-Services, 10User-Urbanecm: Prepare and check storage layer for idwikimedia - https://phabricator.wikimedia.org/T193187#4161962 (10Urbanecm) [16:49:15] 10DBA, 10Patch-For-Review: Productionize 8 eqiad hosts - https://phabricator.wikimedia.org/T192979#4161987 (10jcrespo) Last update about productionization: * db1090 should be able to be repooled soon and so remove db1060 (decom) and db1069 (x1) from production, but I left it being compressed on 2 local screen...