[06:12:25] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3901965 (10Marostegui) [07:01:40] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3901995 (10Marostegui) s8 codfw has been done. Let's track here s8 eqiad progress: [] dbstore1001 [] dbstore1002 [] labsdb1011 [] l... [07:21:46] 10DBA, 10Patch-For-Review: Decommission db1030 - https://phabricator.wikimedia.org/T184397#3902002 (10Marostegui) a:03Marostegui [07:56:48] ready when you are ready [08:00:12] let's go [08:01:32] let me warm up the servers [08:01:37] sure [08:09:39] I made the plan on the etherpad [08:09:53] let me see [08:10:40] sounds good [08:11:24] merging puppet and mediawiki patches [08:12:03] what do you want me to do? [08:12:13] check kibana [08:12:20] on my way! [08:12:23] and general health [08:12:28] let's go to ops [08:12:30] cool! [08:27:31] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, and 2 others: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3902059 (10Marostegui) [08:29:12] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3902060 (10Marostegui) I have not seen any more delays on the wiki replicas since this was set up, so those thresholds are looking pretty good! [08:36:02] do you mind leaving the etherpad with that until tomorrow, when I switch dbstore1002? [08:36:29] I mean dbstore1001 [08:36:31] of course not! [09:01:04] 10DBA, 10Data-Services: Re-institute query killer for the analytics WikiReplica - https://phabricator.wikimedia.org/T183983#3902093 (10jcrespo) I've seen at times 30 seconds, but I guess that is ok? [09:05:23] 10DBA, 10Patch-For-Review: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3902099 (10jcrespo) [09:05:35] 10DBA, 10Patch-For-Review: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3870892 (10jcrespo) a:03jcrespo [09:06:19] 10DBA, 10Patch-For-Review: Decommission db1029 and db1031 - https://phabricator.wikimedia.org/T184054#3870892 (10jcrespo) We will wait some days before going on, to make sure the new masters work as intended/no errors/no data loss. [09:31:40] now that sarin is booted into the new kernel, neodymium is next, I noticed a number of DBA jobs running on neodymium, do we have a time window this week to reboot it? [09:32:14] moritzm: From my side, my long connections will be finished tomorrow morning (or even today late in the evening) [09:32:17] I will move back to sarin :) [09:35:46] ok, nice [09:47:14] One thing, Yesterday 14:00 UTC, we enabled a features in cewiki, cawiki, elwiki, and trwiki that should make wbc_entity_usage a little bit bigger and as a result the writes will increase as well but not much [09:47:28] and it's needed for proper RC injection (and jobqueue issue) [09:47:57] I'm monitoring lags and nothing popped up yet [09:48:30] thanks for the heads up [11:49:20] I am going to insert some fake pages on db1112:commonswiki (test server) to test some compare.py optimizations [11:57:43] https://phabricator.wikimedia.org/P6591 [14:57:48] sorry for the late reply I went for lunch! [14:57:59] That is a great improvement, that will be really really useful [14:58:09] for the things I have in mind to put in place as "reports" [14:58:23] it got better since midday [14:59:01] marostegui: https://phabricator.wikimedia.org/P6591#37116 [14:59:29] it also work in parallel between servers, I will have a look at having more than one thread per server [14:59:35] not sure I get the 100 iterations thing [14:59:38] aaaaah [14:59:44] maybe I can do it by time [14:59:55] every X seconds, rather than every X queries [15:00:16] then I got confused again, what is the 100 iterations? [15:00:16] there is now a --print-every that defaults to 100 [15:00:48] 1 Iteration == 1 chunk == #step# rows [15:01:13] here step is 10000, so 10000 * 100 rows [15:01:27] but probably the time is a better option [15:01:36] So, 1000 rows in chunks of 100 [15:01:51] "X seconds passed since start" ? [15:06:43] marostegui: https://phabricator.wikimedia.org/P6591#37117 [15:10:07] nice, every 10 seconds! [15:11:00] There is one thing that would be super useful and is to give the option to exactly print the value (or the PK value) that is different [15:12:45] I thought about it, and I may not do it, at least not soon [15:13:19] the idea is taht if you find a chunk that is different, you can run it with --step=1 [15:13:25] and that will give you all the rows [15:13:29] that's true [15:13:46] the problem is that it could be a lot of rows [15:13:53] and you would see nothing [15:14:23] yeah, but you can always redirect to a file or something [15:14:29] and the steps should be small enough to reimport full chunks [15:15:10] maybe we can do it if it is only a few rows inside a chunk [15:16:21] also, what if there is some false positives? [15:16:41] but that can also happen now, I always do two iterations [15:17:08] then it would be nice if it was done automatically by the script [15:17:25] all that is more "advanced", and I am not saying I will not do it [15:17:32] just not now [15:17:47] Yeah yeah, it was just a suggestion of something I have been missing while heavily using compare.py lately :) [15:18:04] the idea of this is to get a 0 or 1 if they are the same or different [15:18:38] send a bug for wishlist :-) [15:19:12] where to! :) [15:19:19] phabricator [15:19:40] :) [15:19:41] I think the advanced stuff should be a separate script, that checks a range [15:19:55] and gets a proposed REPLACE to run on the slave [15:20:08] with double checking it is not a false positive, etc [15:23:58] that'd be a great win [15:24:59] I would prefer to setup parallelism, so it can run faster on idle servers [15:25:12] or even FILE to live comparison [15:26:46] I wonder how I should implement paralelism- if I run on table with 4 threads, should I execute on rows 1-1000, 1001-2000, etc.? [15:27:24] or pre-partition the table in 4 slices? [15:27:41] I would go for the first option [15:27:52] yeah, so errors are shown in order? [15:27:58] exactly [15:28:02] and if a table is loaded at the end [15:28:02] otherwise it can be confusing [15:28:05] depends if you prefer to scan the table linearly or have a quicker feedback of the state of the table in different positions IMHO [15:28:25] I think linearly should get some benefit due to innodb cache handling [15:28:43] due to table and disk locality [15:28:56] but that is only a theory [15:31:46] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3902934 (10Papaul) a:05Papaul>03Marostegui @Marostegui Disk replacement complete. [15:33:10] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3902940 (10Marostegui) Thanks! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 1% complete) physicaldrive 1I:1:6 (port 1I:box 1:bay 6, SAS, 600 GB, Rebuilding) ``` [19:26:56] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2036 - https://phabricator.wikimedia.org/T184836#3903753 (10Marostegui) 05Open>03Resolved All good! Thanks a lot Papaul! ``` root@db2036:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 001438031205FF0) Port... [19:28:01] https://phabricator.wikimedia.org/P6591#37144 [19:35:37] That new feature just made my day [19:35:39] Seriously [19:35:48] Comparing more than one server!!!! <3 [19:36:21] That simplies a lot the ideas I had for the automatic reporting :) [19:39:52] 10DBA, 10Operations, 10Release-Engineering-Team, 10cloud-services-team: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805#3903854 (10Ottomata) p:05Triage>03Normal [19:40:02] 10DBA, 10Operations, 10Patch-For-Review, 10Puppet: Move mariadb_maintenance away from terbium/wasat (mediawiki_maintenance) - https://phabricator.wikimedia.org/T184797#3903855 (10Ottomata) p:05Triage>03Normal