[06:55:27] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: db2068 storage crash - https://phabricator.wikimedia.org/T180927#3780283 (10Marostegui) 05Open>03Resolved [07:49:03] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and deploy schema change on dropping oresc_rev_predicted_model index - https://phabricator.wikimedia.org/T180045#3780295 (10Marostegui) [07:50:36] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and deploy schema change on dropping oresc_rev_predicted_model index - https://phabricator.wikimedia.org/T180045#3780297 (10Marostegui) [07:57:02] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and deploy schema change on dropping oresc_rev_predicted_model index - https://phabricator.wikimedia.org/T180045#3780300 (10Marostegui) [09:34:51] If I can get a second pair of eyes: https://gerrit.wikimedia.org/r/#/c/392797/1 that would be appreciated :) [09:53:03] there seems there was an issue with db1099.eqiad.wmnet this morning [09:53:46] https://logstash.wikimedia.org/goto/fd90f1d49f52e635b6224fc2af6a6397 [09:54:28] or is it db1051.eqiad.wmnet ? [09:55:45] db1051 was depooled in enwiki, for days [09:56:03] since 16th nov [09:56:40] look at the graph, errors starting at 9:20 [09:57:05] Unable to find pt-heartbeat row for 10.64.16.76 [09:57:42] That is weird, because db1051 has been copied from db1063 [09:57:48] And replication is working fine in that sense [09:57:54] (and it is ROW based, as it is s5) [09:57:58] db1051 is depooled btw [09:58:12] something is happen [09:58:16] *ing [09:58:26] maybe the scap failed on some hosts? [09:58:42] the errors are all from mw1191 [09:58:50] let's see [09:58:54] but db1051 was never pooled eh [09:59:23] I think that host should have very old code [09:59:30] yes [09:59:32] still thinking it is pooled on enwiki [09:59:34] it is poooled in that host [09:59:38] wow [09:59:39] yeah [09:59:46] and it was depooled the 16th nov! [09:59:59] is it pooled now? [10:00:12] on that host yes, but everywhere, no [10:00:27] https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php [10:00:37] that host even have db1063 as s5 master [10:00:40] no, I mean [10:00:46] is mediawiki server pooled [10:00:50] ah [10:01:00] don't know [10:07:29] Can you review: https://gerrit.wikimedia.org/r/#/c/392797/ ? [10:15:45] em [10:16:08] Patch Set 1: Code-Review+1 10:37 (local time) [10:18:31] oh weird [10:18:34] never got that notifications [10:18:35] sorry [10:24:29] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3780703 (10Marostegui) [10:46:11] should I create now s8 on codfw? [10:46:22] +1 for that! [10:51:30] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-extensions-ORES, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and deploy schema change on dropping oresc_rev_predicted_model index - https://phabricator.wikimedia.org/T180045#3780746 (10Marostegui) [11:26:21] 10DBA, 10Wikidata: Provision a separate DB shard for wbc_entity_usage - https://phabricator.wikimedia.org/T176277#3780841 (10jcrespo) a:05jcrespo>03None This need hardware provisioning, and that means budget, and that means a detailed plan with our overall ok. [11:28:14] I am going to drop amwikimedia from s7 [11:28:22] I am going to do it server by server [11:28:30] because there is a real wiki on s3 [11:28:39] so it would cause problems otherwise [11:28:44] yeah, that issue that happened because of running the script twice or something [11:28:54] will it break x1? [11:28:57] x1? [11:29:13] no, either there is a database there or not [11:29:52] I am going to do it on purpose only on the single s7 hosts (not multisource), not on x1 or on s3 or on multisource [11:30:09] cool! [11:30:23] so we can close T176043 [11:30:23] T176043: Prepare and check storage layer for amwikimedia (including dropping s7 version of the wiki) - https://phabricator.wikimedia.org/T176043 [11:30:56] while that is ongoing, I will prepare s5/s8 split [11:31:09] I am looking at patches, but they will probably need to be rebased manually [11:31:25] I will not do any filter/deletion in any case [11:31:34] just the logical and topology split [11:31:49] which actually, should be super-reliable with ROW :-) [11:31:52] haha [11:31:59] indeed, we should be good now with ROW XD [11:32:24] then I may ask for help to amir [11:32:29] to check things are ok [11:33:29] we should do a master failover on codfw, too, right? [11:33:38] so we get rid of the old hosts [11:33:46] yeah, only for that reason [11:33:50] db2023 [11:34:12] I think the masters and maybe misc are the only hosts left in production [11:34:21] I think we talked about db2045 as replacement [11:34:37] actually, some masters are still pooled as slaves [11:34:38] yeah, only miscs and masters and db2016 (old master) [11:34:46] yeah [11:35:04] for that we can just reclone another host from it and throw it away [11:35:35] eqiad still has a lot of old hosts [11:35:47] lot? [11:35:52] not really 4-5 or less I think [11:35:59] (not counting misc) [11:36:13] ok [11:36:15] 15 on tendril [11:36:25] there is still 46 and s7 there [11:36:26] db1044 and db1034 are ready to go away [11:36:42] db1044 we only need to move db1095 to db1072 and db1034 I will actually decom it tomorrow [11:37:07] we are blocked on some misc failovers, however [11:37:17] which are blocked on reimages [11:37:25] which are blocked on proxy reimages [11:37:27] haha [11:37:32] which are blocked on proxy failovers [11:37:34] which is all blocked on time [11:37:35] XD [11:49:20] 10DBA, 10MediaWiki-Database, 10Patch-For-Review, 10PostgreSQL, 10Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#3780868 (10jcrespo) a:05jcrespo>03None Still a work in progress. [11:49:59] 10DBA: Check recentchanges table and query errors on wikis other than commonswiki and ruwiki - https://phabricator.wikimedia.org/T178290#3780870 (10jcrespo) a:05jcrespo>03None [12:02:58] I am keeping a copy of amwikimedia-s7 on my home on neodymium- it is mostly empty [12:04:20] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3780899 (10Marostegui) db1101.s7 has been fully pooled. I will leave it running along with db1034 for 24h, then I will depool db1034 and leave it for a few days to make sure it all runs smoothly and we c... [12:04:33] literally empty (only create statements) [12:05:43] the problems is I cannot be sure 100% we have deleted all instances [12:06:07] what do you mean [12:06:21] server down not on tendril [12:06:29] not on the list of hosts [12:06:37] or not replicationg [12:06:45] mmm, unlikely, no? [12:06:50] yeah [12:07:10] but not that it hasn't happened in the past [12:13:47] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare and check storage layer for amwikimedia (including dropping s7 version of the wiki) - https://phabricator.wikimedia.org/T176043#3780911 (10jcrespo) a:05jcrespo>03None Dropped amwikimedia from all s7 hosts (that were not... [12:14:11] 10DBA, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Prepare and check storage layer for amwikimedia (including dropping s7 version of the wiki) - https://phabricator.wikimedia.org/T176043#3780914 (10jcrespo) p:05High>03Normal [12:15:06] heads up of the topology changes I will be doing for s8 on codfw (with potential minor changes on eqiad) [12:16:42] ok, we shouldn't be stepping on each other's toe at this point [12:16:46] I am working with db1097 [12:16:49] so nothing to worry about [12:16:52] cool [12:17:04] may delay after lunch [12:17:34] enjoy lunch! [12:25:45] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3780960 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` db1097.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2017112... [12:43:40] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3781030 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1097.eqiad.wmnet'] ``` and were **ALL** successful. [13:35:50] labsdb1004 replication broken [13:35:57] trying to drop a non existing view [13:36:52] I have skipped it for now [13:38:19] https://phabricator.wikimedia.org/P6364 [13:46:53] +1 [14:00:06] should I pool back db1068? [14:00:32] sorry [14:00:35] I mean db2068 [14:00:43] sure [14:00:49] I forgot it was depooled, sorry [14:04:22] noticed it on manual rebase [14:05:08] can i nitpick? :) [14:05:25] ah no [14:05:26] nevermind [14:05:27] :) [14:05:43] actually, I was going to ask for a re review [14:05:51] so please go ahead [14:05:52] doing it now :) [14:08:00] I will do the topology changes meanwhile [14:08:15] it looks going, going to +1 it [14:09:04] technically I did not other changes before your last +1 [14:09:12] except db2068 [14:09:23] but maybe something got lost by other changes [14:09:53] i was going to suggest remove db2038 and leave the multiinstance to "serve" everything, but that is not needed really now [14:09:56] so not a biggie [14:15:02] oh, I see [14:15:09] we definitely will do that [14:15:17] but I want to handle s8 first [14:15:23] then I "fix" s5 [14:15:23] yeah, as I said, not a big deal [14:15:53] worried about doing too much at the same time [14:15:53] once the config is tested, we can slowly fix s8, as in 3315 -> 3318 and all that stuff [14:16:03] exactly [14:16:05] yeah, let's just try the topology+config [14:16:11] the other thing…there is no rush [14:16:20] that may require a full rolling restart [14:16:32] well, not full, of all multi-instance ones [14:16:34] you mean mysql no? [14:16:39] yes [14:16:41] sorry [14:16:45] yeah, we will need that [14:16:52] actuall small changes to all [14:16:53] but as said, we'd have no rush for that [14:17:00] for icinga names and that stuff [14:17:24] I will do this first, then leave it for some time, test it,etc [14:17:28] indeed [14:21:49] we have to create the s8.hosts too btw [14:21:56] again, not big deal [14:21:59] just a reminder [14:21:59] lots of things :-) [14:22:44] one thing I am not touching is the dblist until all, including eqiad, are in place [14:23:03] yeah [15:28:15] hey, Amir1, if you have time regarding wikidata testing, so you have some urls that would be read only to check it works as intended? [15:28:38] aside from trivial reading pages, recentchanges, etc. [15:29:00] jynus: sure [15:29:13] I am open for any suggestion [15:29:42] maybe we could even create some artificial load [15:33:08] also, as a deployer, do you know if s5.dblist would be used somwhere in configuration (I can grep, but in case you know something else)? [15:34:04] actually, I catched the first error [15:35:04] jynus: I don't think s5 is used anywhere [15:35:11] we use the dbname directly [15:35:19] cool [15:35:33] so the eror I just realized is that as s8 doesn't exist on the master yet [15:35:41] queries error out [15:35:46] regarding urls, I can think of lots of API urls [15:36:01] put some on a paste or somewhere [15:36:44] okay [15:36:46] it doesn't have to be extensive, just something to have other than simple gets [15:36:53] of regular pages [15:39:47] https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42%7CP17 [15:39:51] this is one thing [15:40:20] These are docs, using the example would be okay: [15:40:24] https://www.wikidata.org/w/api.php?action=help&modules=wbsgetsuggestions [15:40:24] https://www.wikidata.org/w/api.php?action=help&modules=wbavailablebadges [15:40:24] https://www.wikidata.org/w/api.php?action=help&modules=wbcheckconstraintparameters [15:40:24] https://www.wikidata.org/w/api.php?action=help&modules=wbcheckconstraints [15:41:15] thanks [15:45:32] that is very helpful [15:47:03] The links query from different tables [15:48:16] so I think to test edits, we could do reads, do edits on the real production servers, and check those changes get propagated properly [15:48:41] also, if I render aa page with wikidata text for the first time, wikidata is queried, right? [15:48:48] wikidata references, I mean [15:49:01] interwikis, or template pieces [15:49:22] or is it localy cached on the local database? [15:49:57] I assume it has to be checked on render with no cache? [15:50:21] it is ok to say "I don't know" or if you are busy [16:23:21] 10DBA, 10Patch-For-Review: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359#3781687 (10Marostegui) db1097.s4 is now replicating [16:30:27] so it went relatively well no? :) [16:31:45] yeah, wanted to warn of the pt-heartbeat thing [16:32:14] maybe now I would start depooling hosts from eqiad [16:32:39] and do all those small things of restarts [16:32:51] and config changes [16:32:51] that would be good [16:32:57] but don't touch db1101 :-) [16:32:58] and labsdb [16:33:02] that is my only requirement :) [16:33:04] ok [16:33:12] i am finishing compression there [16:33:36] actually, I could touch it, if i do not reboot it [16:33:47] for eqiad, I would not restart anything [16:33:50] for now [16:33:56] ah yeah, touch it, yes [16:34:00] restart no :) [16:35:23] so what hosts are or will be moltisource? [16:35:35] for s8 and s8? [16:35:56] let me see [16:36:10] db1097 ? [16:36:12] so db1096 and db1099 will serve rc but not as multi-source [16:36:22] and db1101 and db1097 will be multi instance [16:36:33] we can convert db1096 and db1099 later [16:36:48] so i would suggest we pair them like 1 normal + 1 multi-instance [16:37:07] for example s5: db1096 and db1101 and s8: db1099 and db1097  [16:37:22] and we can convert db1096 and db1099 later with no rush [16:37:38] sure, let's make a list summarizing that [16:38:04] I think I was going to do that when the outage happened [16:38:26] db1101 is ready but compressing and db1097 will be cloned from db1101 once it is ready, hopefully tomorrow [16:38:31] and all the HW will be in place :) [16:38:32] or is it? https://gerrit.wikimedia.org/r/#/c/391198/5/wmf-config/db-eqiad.php [16:38:58] yeah, that is the start, but it is not finished [16:39:06] i know [16:39:17] dump literally your suggestions there as a comment [16:39:17] yeah that was a good draft [16:39:20] so they are not lost [16:39:29] I will obviouly not commit anything [16:39:32] ok, let me do it [16:39:47] but it is getting complex and easy to lose track [16:39:50] 0:-) [16:40:13] yeah, let me do it a piece of paper here first XDD [16:40:14] that is why I wanted to deploy the codfw change- now it is easy to see [16:42:55] I will give manwhile db1095 (sanitarium a try) with filters, but it is going to be complex due to heartbeat sharing [16:43:28] not even sure it is worth changing beforehand [16:44:32] don't know how to comment on that ticket [16:44:37] or do it on an etherpad [16:44:40] with my proposal [16:44:52] let me do it in an etherpad [16:45:00] it might be easier to modify and/or comment [16:45:14] sure [16:49:37] there you go [16:49:41] what do you think [17:00:03] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3781800 (10alanajjar) [17:01:00] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3781817 (10Marostegui) if you want to do it now, go ahead. [17:17:58] jynus: sorry, I was busy [17:18:19] (afk for dinner) [17:18:41] the rc records of wikidata will be injected to the client's database but for other cases it does query the repo (wikidata) too [17:18:57] ok, thanks [19:07:55] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename of JeanBono → Rexcornot: supervision needed - https://phabricator.wikimedia.org/T181170#3782418 (10alanajjar) @Marostegui sorry for late! When you here ping me again (Y) [22:11:37] 10DBA, 10Analytics, 10Operations, 10Patch-For-Review, 10User-Elukey: Decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3782789 (10Nuria) @Nettrom Right, is not only that dashboard but all the ones that are feed data via reportupdater that needed the new configura...