[06:37:40] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 (10Marostegui) [06:37:48] 10DBA, 10Schema-change, 10Tracking: [DO NOT USE] Schema changes for Wikimedia wikis (tracking) [superseded by #Blocked-on-schema-change] - https://phabricator.wikimedia.org/T51188 (10Marostegui) [06:37:51] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change: Dropping site_stats.ss_total_views on wmf databases - https://phabricator.wikimedia.org/T86339 (10Marostegui) 05Open>03Resolved This is all done [06:41:37] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 (10Marostegui) a:03Marostegui [06:47:27] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 (10Marostegui) s6 progress: [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore2001 [] dbstore1002 [] dbstore1001 [] db2095 [] db2089 [] db2087 [] db2076 [] db2067 [] d... [06:47:44] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 (10Marostegui) [07:57:47] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Scoring-platform-team, 10User-Ladsgroup: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 (10Marostegui) a:03Marostegui I have taken a look at the status of the `tmp_1 or tmp_2 or tmp_3` status. s1 : Has `KEY... [08:10:58] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Scoring-platform-team, 10User-Ladsgroup: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 (10Marostegui) [08:18:23] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Scoring-platform-team, and 2 others: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 (10Marostegui) I have deployed this index on db1096:3316 (s6) and I will leave it like that for some time to see if there is any... [08:19:15] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Scoring-platform-team, and 2 others: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 (10Marostegui) s6 eqiad progress [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore2001 [] dbstore1002 [] dbstore1001 [] db209... [08:19:41] 10Blocked-on-schema-change, 10DBA, 10MediaWiki-Database, 10Scoring-platform-team, and 2 others: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 (10Marostegui) [08:59:41] marostegui: I'd like to proceed on dropping user_options on s6 master, do you have any objections? [09:00:01] I don't as long as you have done your research :) [09:00:09] Also, how are you going to monitor it is not causing issues [09:07:42] I've read the percona article, about the metadata locking, and the mariadb article too, I've re-read the Online DDL overview article on mariadb site, and now I am pretty confident it shouldn't cause any trouble for us now. At least on s6. My plan is to do this with `--db` instead of `--dblist` (one-by-one) [09:08:04] sounds reasonable [09:08:06] For monitoring my plan is to keep open logstash, and see if any error shows up when doing the alter [09:08:24] that might take sometime, maybe you need to do some real time monitoring? [09:11:02] maybe, but I don't have any idea about real-time monitoring out of digging the logs. Could you hint me? [09:11:20] so, the first thing is…what would you be looking for? [09:11:29] like, what do you want to look for? [09:11:45] or rather, what can go wrong with the alter? [09:14:00] What can go wrong: the DML queries could fail during the DDL [09:14:22] right, so they can fail in which way? [09:15:28] The application could get 'lock wait exceeded ...' errors during the DDL [09:15:32] there you go [09:15:41] so how can you see if that is happening real time? [09:16:39] I'd still say logstash, but I think you have a better idea? [09:17:01] but you won't see queries piling up on logtash, no? [09:17:18] oh [09:17:21] where can you see those? [09:17:25] 'show processlist'; [09:18:07] better to you select * from processlist; as that is non blocking [09:19:31] TIL. [09:19:40] in case you see them piling up, what's the plan? [09:20:38] `KILL` the alter [09:21:02] I don't think that will fix the issue, it can make it even worse [09:21:43] then I can decrease the lock_wait_timeout, and let mysql handle the piling up connections? [09:23:35] probably, you might also want to start by the smallest table of those 3 databases, so the alter runs faster and if there is an outage, it lasts less time [09:25:36] 👍 [09:28:29] so as a recap. I'll run the alter by databases one-by-one instead of using the dblist, I'll start with jawiki (as it is the smallest table) and I'll keep checking the output of `SELECT * FROM PROCESSLIST` on the server itself. If connections are piling up during the alter, I'll decrease the `lock_wait_timeout` variable and let mysql kill those connections [09:28:33] sounds like a plan? [09:29:27] if connections pile up, it means you are locking the table, so lag will happen on the slaves, so you'll need to downtime those to avoid massive paging if the alter lasts for long [09:29:33] how long does it take for jawiki more or less? [09:30:12] 20-30 seconds [09:30:23] Good [09:30:32] You'll be able to evaluate if there is any incident [09:31:00] I think so [09:31:29] Go ahead then, make sure you are not replicating the alter [09:32:19] ok, I downtime all s6 hosts then with icinga-downtime from einsteinium [09:32:23] then start the operation [09:32:28] there is no need to downtime them [09:32:45] banyek: einsteinium is no more the active Icinga server [09:32:46] 30 seconds will not be enough to cause a replication page on them [09:32:52] thanks volans! [09:32:55] downtimes there have no effect, use icinga1001 [09:33:16] or ssh icinga.w.o if using the script to populate known hosts with the dns repo [09:33:16] banyek: I normally use icinga.wikimedia.org so it gets me to the active one [09:33:27] tx, noted [09:41:58] [OT] but I've opened T210380 ;) [09:41:59] T210380: Icinga downtime script should fail on the passive hosts - https://phabricator.wikimedia.org/T210380 [09:42:22] good idea! [09:45:16] ```./wmfmariadbpy/wmfmariadbpy/osc_host.py --method=ddl --host db1061.eqiad.wmnet --db jawiki --table user --no-replicate --debug "DROP COLUMN IF EXISTS user_options"``` [09:45:50] looks good [09:46:04] then brace ourselves, I start it [09:46:09] * marostegui hides [09:46:39] seems good so far [09:46:43] :-) [09:47:06] it went thru, right? [09:47:20] I had one '`Waiting for table metadata lock SELECT`' but it gone [09:47:36] (I was running `while true; do mysql --skip-ssl information_schema -e "select * from processlist where db = 'jawiki';" -BN ; echo ;sleep 1; done`) [09:47:49] did the alter finish then? [09:47:52] yes [09:47:54] 34 seconds [09:48:08] congrats - you've altered your first active master! [09:48:13] yay [09:48:21] what's the next db? [09:48:27] in size I mean [09:48:48] ruwiki [09:48:52] 1.1 G [09:49:01] and frwiki after 1.4 G [09:49:19] how long does those take? [09:49:40] all four takes about 2 minutes alltogether [09:49:47] 30 seconds for jawiki [09:49:55] about 40 for ruwiki [09:50:00] the rest is for frwiki [09:50:04] good [09:50:12] *all three [09:51:36] I proceed to ruwiki then [09:52:12] yep [09:54:41] ruwiki is done [09:54:58] great [09:56:17] and I proceed to frwiki [09:56:57] wow, after the drop the tables shrinked to about the half size [09:57:16] 546 -> 316, 1,1G -> 596M [09:57:17] it rebuilt the table [09:57:21] yes [09:57:26] every byte is sacred [09:57:31] XDD [09:58:06] I have cleaned up my part from last week on the etherpad and added the new ones [09:58:42] good I started with mine as well, but I still have to work on it [09:58:59] :) [09:59:23] (that's a reason i am pushin s6 now, I want to write down 's6 is done' [09:59:25] ) [09:59:32] hehe [09:59:35] alterin frwiki [09:59:46] g [10:02:37] and frwiki is done too [10:02:45] s6 ✅ [10:04:25] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) [10:09:12] great! [10:43:53] 10DBA, 10Patch-For-Review, 10User-Banyek: Productionize dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T202367 (10Banyek) [10:44:35] I got distracted by clinic duty, rushing to meeting, may be 1 or 2 minutes late [11:02:09] 10DBA, 10MediaWiki-Database, 10TechCom-RFC: RFC: Proposal to add wl_addedtimestamp attribute to the watchlist table - https://phabricator.wikimedia.org/T209773 (10D3r1ck01) [11:03:25] 10DBA, 10MediaWiki-Database, 10TechCom-RFC: RFC: Proposal to add wl_addedtimestamp attribute to the watchlist table - https://phabricator.wikimedia.org/T209773 (10D3r1ck01) [13:15:13] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review, 10Schema-change, 10User-Banyek: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 (10Banyek) [13:21:37] 10DBA, 10Analytics, 10Analytics-Kanban, 10Core Platform Team, and 2 others: Not able to scoop comment table in labs for mediawiki reconstruction process - https://phabricator.wikimedia.org/T209031 (10JAllemandou) [14:01:33] i am preparing the patches for s2 schema changes, and db1095 has no trace in db-eqiad.php [14:02:05] Look for it on site.pp ;) [14:02:51] <3 [14:02:53] tx [14:05:02] banyek: that is the reason why the instances table exists, to centralize that [14:51:26] actually besides the commit msg (and comments) I become more and more fluent in weight adjustments. ;) Hopefully on next iteration (s5) all will be good [14:51:47] the more you practice the better you will get ;) [14:51:57] I put these to hold until tomorrow, and start with schema change on s2 [14:52:05] now I update the etherpad [14:52:13] and talk to brooke about the labsdb hosts [14:52:20] great [14:52:21] and then we'll have the analytics meeting [14:52:29] any pages from last week? [14:52:47] I have to check those, because I don't remember [14:52:53] yeah, me neither [14:52:55] I don't think so [14:53:06] but double check while updating the etherpad [14:53:14] 👍 [14:53:20] thank you [15:50:29] replication is broken on db1122 [15:50:55] banyek: read -operations, that paged 20 minutes ago (and db1095) [15:51:54] I dunno why I did miss it [15:51:55] :( [16:01:31] 10DBA, 10Operations, 10monitoring, 10Patch-For-Review: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473 (10Dzahn) [16:29:58] banyek: it should've arrived to your phone [16:30:25] That's why I noticed it, because I've seen the led flashing [16:30:33] so that is the plan, rebuild the hosts? [16:30:42] probably I was in the bathroom when it happened :( [16:30:57] jynus: yeah, I am half way into rebuilding db1122 [16:31:03] oh, that is so quick [16:31:05] Didn't want to leave s2 without that one for the night [16:31:13] sorry to ask you to take care of it, it was either you or me [16:31:31] db1095 is not critical- until tomorrow :-) [16:31:38] jynus: Also, it was a race condition, looks like that host was added to s2 in the middle of the schema change [16:31:45] so it was missed as it wasn't fully part of s2 :) [16:31:46] oh, how? [16:31:55] copied [16:31:57] ? [16:32:00] Yeah [16:32:02] cloned from another host? [16:32:09] we then need to review the procedures [16:32:20] Probably cloned from a host that didn't have the schema change yet, and then forgotten as it wasn't in the lists yet [16:32:39] banyek: that is why we do one section at a time [16:32:47] Or the list was generated before the host was in s2 [16:32:50] and not start with the next until it is fully done [16:32:52] Which is probably what happened [16:33:00] is it in the current list? [16:33:11] maybe it didn't get added [16:33:32] yeah [16:33:43] So what I think what happened [16:33:47] s2 starts the schema change [16:33:52] and I generate a list of hosts to alter [16:33:58] then in the middle of it, db1122 gets added [16:34:11] but I didn't update the list of hosts on the schema change phabricator task [16:34:19] https://phabricator.wikimedia.org/T188299#4160112 [16:34:21] see? [16:34:24] missing there [16:34:56] we should check it on all hosts [16:35:02] and add to the procedure [16:35:12] to check on all hosts after all are done [16:35:20] yeah, I do it now as it is easier with "section" script [16:35:31] but that was in april! [16:35:34] wow, time flies [16:35:57] so yeah, race condition [16:36:11] and my bad for not checking [16:36:15] we should review other changes around that time [16:36:35] I did [16:36:41] Because I normally do more than one at the same time [16:36:52] I also have a one liner to check those, if you need it [16:36:54] And they were missing, that's why I decided to rebuild the host [16:37:02] No, I checked already :) [16:37:04] cool [16:37:09] thanks! [16:37:10] There were 3 schema changes done [16:37:14] At the same time [16:37:18] By looking at SAL :) [16:37:29] banyek: ^ reason we ask you to log every single thing :) [16:37:38] seems legit [16:37:47] (I mean no irony.) [16:39:32] brb 10 min [16:56:29] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) [16:58:29] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) a:05Cmjohnson>03RobH Rob, Can you complete the installs of pc1008-pc1010. The server used for pc1007 arrived DOA and a ticket with Dell needs to be submitt... [17:03:30] 10DBA, 10Data-Services, 10Patch-For-Review, 10User-Banyek, 10cloud-services-team (Kanban): Upgrade/reboot labsdb* servers - https://phabricator.wikimedia.org/T209517 (10Banyek) [17:06:46] 10DBA, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10RobH) p:05Normal>03High [17:07:14] 10DBA, 10Operations, 10decommission, 10ops-codfw, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 - https://phabricator.wikimedia.org/T209858 (10RobH) This is high priority due to return to Farnam in December. I'll get these ready for onsite wipe ASAP. [17:08:44] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) p:05Normal>03High Setting up to high as we need to get the old ones out before the leasing deadline [17:09:08] ^we need to decide who will take care of those [17:10:38] pc100*? [17:11:01] yes [17:11:16] I will do those [17:57:16] I leave for today [19:26:39] 10DBA, 10Operations, 10ops-eqiad: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10RobH) [19:33:54] 10DBA, 10Operations, 10Patch-For-Review, 10User-Banyek: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 (10RobH) Please note that pc1008, pc1009, and pc1010 are ready for #dba team to take them over. OS is installed and run... [19:34:20] 10DBA, 10Operations, 10ops-eqiad: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10RobH) [19:55:11] 10DBA, 10Operations, 10ops-eqiad: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10RobH) a:05RobH>03Cmjohnson Assigning back to Chris for the followup to repair pc1007. The other servers have been handed off to #dba team for use via task T208383 [23:05:03] 10DBA, 10Cloud-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for yuewiktionary - https://phabricator.wikimedia.org/T205714 (10Bstorm) [23:37:10] 10DBA, 10Cloud-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for liwikinews - https://phabricator.wikimedia.org/T205713 (10Bstorm) All set. Verified I can connect and select from cloud systems. [23:37:18] 10DBA, 10Cloud-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for liwikinews - https://phabricator.wikimedia.org/T205713 (10Bstorm) 05Open>03Resolved [23:37:57] 10DBA, 10Cloud-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for yuewiktionary - https://phabricator.wikimedia.org/T205714 (10Bstorm) 05Open>03Resolved a:03Bstorm views, indexes, dns, etc. all set. Tested from toolforge.