[07:39:49] 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2817113 (10mmodell) So from upstream, it seems that phabricator is just behaving the same way as other calendar apps and the language about "future events" is just potentially... [07:40:43] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817114 (10Marostegui) As we spoke yesterday. I am using db1052 (which was depooled yesterday) to import S1's tablesspace to db109... [07:51:24] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2817119 (10Marostegui) Thanks - doing it now. [07:55:23] 10DBA: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817124 (10Marostegui) Given that m3 isn't big (100G) I can either reimport that table or the whole tablespace from db2012 [09:03:02] 10DBA, 06Operations: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2817177 (10Volans) p:05Triage>03Normal [09:45:20] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2817207 (10Marostegui) Hey @Papaul The server crashed again - same symptoms: First attempt: ``` iLO Advanced 2.40 at Dec 02 2015 Server Name: WIN-12 Server Power: Off ``` ``` hp... [09:49:39] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817211 (10jcrespo) dbstore1001 doesn't use GTID, and it is a delayed slave that starts replication automatically, so it is not sim... [09:51:04] ^ jynus then we probably need to use another slave to test row... [09:51:31] why? [09:52:38] that or we migrate dbstore1001 to the master, or you are confident that if we change the binlog format to row dbstore1001 won't break? [09:52:46] 10DBA: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817218 (10jcrespo) We should stop the slaves in sync and import at least the tables on phabricator_conduit. [09:53:22] why do you think dbstore1001 will break in row based replication? [09:54:01] I don't know if it will break, but if the data isn't the same it might, that is why I am asking you :) [09:54:29] 10DBA: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817219 (10Marostegui) That also works. We can stop db1048's replication for a few seconds let the slaves reach the same position, stop them, a... [09:54:32] if data isn't the same we would want to know, given that it is our backup system [09:55:07] but we have to move it at some point to the final master, don't you think? [09:55:14] of course [10:05:00] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817243 (10hoo) >>! In T151356#2815197, @jcrespo wrote: > This is the goal, but we will try to achieve this without the force index, dependin... [10:15:28] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817271 (10jcrespo) The problem is not how to do the force, that can be added directly, the problem is that force index is a poor workaround,... [10:18:24] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817274 (10jcrespo) a:03jcrespo I will move the reminder dbstore1001 replication channels to the right master, hopefully not brea... [10:18:40] 10DBA: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817276 (10Marostegui) a:03Marostegui [10:23:58] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817294 (10hoo) p:05Triage>03High [10:31:08] jynus: is it fine to kill pt-heartbeat on db1048 for a few seconds so it doesn't write to the binlog and I can stop the slaves at the same position? [10:31:39] go to the screen and see if it has logged any query [10:32:28] which screen? [10:32:48] how many screens does db1048 have? [10:32:56] 0 [10:33:10] root@db1048:~# screen -ls [10:33:11] No Sockets found in /var/run/screen/S-root. [10:33:11] root@db1048:~# [10:33:35] pt-heartbeat? [10:33:50] it should not be running there [10:33:56] it should be running on the master [10:34:04] check the puppetization [10:34:09] oki [10:34:15] maybe that was an old master [10:34:16] I will check [10:38:44] it is marked as master whereas db1043 isn't [10:38:53] according to tendril db1043 is the master of m3 [10:42:35] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817362 (10daniel) So we have the choice of forcing a bad query plan ourselves, or leaving it to Maria to pick a bad query plan... [10:45:58] db1043 isn't running the pt-heartbeat but I do believe it is the master as the processlist shows connections from other hosts [10:47:07] the dbproxy configurations shows the right master (db1043) [10:50:33] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817404 (10jcrespo) > So we have the choice of forcing a bad query plan ourselves, or leaving it to Maria to pick a bad query plan... No, wh... [11:00:24] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817425 (10ArielGlenn) [11:51:09] 10DBA, 13Patch-For-Review: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817496 (10Marostegui) The table `phabricator_conduit.conduit_methodcalllog` has been reimported from db2012 to dbstore10... [11:55:01] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817499 (10jcrespo) @ArielGlenn don't add yourself to this ticket, as it will be closed soon. Chec... [12:10:46] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2817556 (10ArielGlenn) [12:33:46] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817590 (10jcrespo) a:05jcrespo>03None db1052 and the others should be clear to be used. [13:10:29] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817656 (10Marostegui) All done, db1052 is now master of db1095 which is replicating ROW-based rep... [13:26:58] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817685 (10jcrespo) __wmf_checksums can be dropped at any time, it is the table I use for running... [14:13:11] 10DBA, 13Patch-For-Review: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817785 (10jcrespo) 05Open>03Resolved Let's resolve, reopen if something else happens. dbstore1002 is not at all a pr... [14:16:21] 10DBA, 13Patch-For-Review: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817789 (10Marostegui) Sounds good. I will leave the table there and add a note in my calendar to drop it in a couple of... [14:23:25] 10DBA: Media errors on db1048 are creating lag - https://phabricator.wikimedia.org/T151039#2817801 (10jcrespo) p:05Triage>03Low This doesn't really look like disk issues. Anyway, the initial problem (lag) is gone. [14:26:52] as db1092 is depooled, I will use it to experiment for https://phabricator.wikimedia.org/T151356 [14:27:50] Sounds good [14:27:57] Let's see if it crashes again with some activity [14:28:25] yes [14:28:35] I will upgrade it now that it is depooled [14:28:46] and reboot it once for the new kernel [14:28:57] and play (safely) with it [14:29:07] I assume you are not doing anything with it [14:29:18] so this was just a heads up [14:29:18] nop [14:29:29] it is all yours [14:29:42] I will pool it back on Monday most likely if it goes fine for the next few days [14:29:54] I may be too annoying [14:30:01] why? [14:30:08] but 2 dbas on the same machine is a receipe for disaster [14:30:13] ah no [14:30:27] It is better to ask before destryoings other's work! [14:30:27] :) [14:30:28] so I give a heads up when that [14:30:31] exactly [14:30:39] consider however irc as asyncronous [14:30:45] unless I ping you [14:31:14] ok :) [14:31:17] still working on that :p [14:37:32] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817856 (10Marostegui) >>! In T150960#2817685, @jcrespo wrote: > __wmf_checksums can be dropped at... [14:53:01] 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2817901 (10Marostegui) This has been running fine for 6 days already. Once the deploys are un blocked, I will pool it back. [15:17:09] I am enabling histogram creation on db1092 as a test for a very common query optimization problem we suffer [15:19:09] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817971 (10jcrespo) If we have to fail back to index hinting, this should be preferred -ignore rather than force: ``` MariaDB [wikidatawiki]>... [15:30:23] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2818008 (10jcrespo) The histograms are still not enough to convince T151356#2817971 to not optimize the `page_is_redirect` condition. My advi... [16:03:23] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2818068 (10Marostegui) Papaul and myself have been having a chat and we are going to try a few things with this host. The first one, moved it to a different PDU. I am going to try to crash it now [16:41:36] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2818173 (10Marostegui) [16:44:42] 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2818166 (10Marostegui) This is correct, disk `32:4` has been set as offline and needs to be replaced: ``` RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 3.271 TB... [16:46:47] watch "mysql --skip-ssl -e \"SELECT greatest(0, TIMESTAMPDIFF(MICROSECOND, ts, UTC_TIMESTAMP(6)) - 500000)/1000000 AS lag FROM heartbeat.heartbeat WHERE shard='s4' and datacenter='eqiad'\"" [16:47:01] shows 0 lag but the metrics still show spikes [16:47:03] running it [16:47:25] but 1 seconds one, right? [16:47:33] (not saying it is normal) [16:47:50] it has +-0.5 measuring error [16:54:28] it keeps showing spikes but that query shows 0 all the time (almost, never more than 1 sec) [16:54:31] weird [16:55:02] let's wait [16:55:08] metrics have a 5 minute lag [17:09:17] 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2818230 (10Marostegui) Script finished and now db1095 is trying to catch up with the master (I sto... [17:09:26] 10DBA, 06Operations, 06Performance-Team, 07Availability, 07Wikimedia-Multiple-active-datacenters: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2277856 (10fgiunchedi) re: certificate handling that @jcrespo mentioned, see also {T150822} for the related... [17:37:32] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2818317 (10Marostegui) Good news, the first attempt was successful and the server DID NOT crash! I am going to try again just to make sure it was not coincidence or luck. [17:38:47] yay [17:39:32] :) [18:47:56] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Implement a frontend failover solution for labsdb replicas - https://phabricator.wikimedia.org/T141097#2818509 (10jcrespo) a:03jcrespo dbproxy1010 and dbproxy1011 are now serving as proxies for labsdb1009/10/11 on the labs-support network (they ju... [18:48:05] ^ \o/ [18:51:25] 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, and 2 others: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2818541 (10dpatrick) [18:52:01] 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, and 2 others: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2798143 (10dpatrick) @Bawolff, can you take at this? [19:00:01] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2818584 (10Marostegui) The second attempt crashed the server (but it took a lot longer than usual) What Papaul and myself have agreed on as next steps is to change the PSU and see what happens. How... [19:00:05] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Implement a frontend failover solution for labsdb replicas - https://phabricator.wikimedia.org/T141097#2487045 (10jcrespo) [19:00:07] 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2818586 (10jcrespo) 05Open>03Resolved a:05jcrespo>03Cmjohnson The servers are wo... [20:02:13] 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Implement a frontend failover solution for labsdb replicas - https://phabricator.wikimedia.org/T141097#2818714 (10chasemp) How is the haproxy layer failed over (between nodes) in prod atm? LVS or ucarp/VRRP or ? [20:13:52] 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, and 2 others: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2818768 (10Bawolff) a:03chasemp This looks fine. +1 from security. [20:57:45] volans: are you online? [20:57:59] My connection died as I was trying to communicate with you. [20:58:48] If not, just let me know if my cyberbot DB can be set to read only mode. I want to test some error handling of IABot, and the GUI interface I am writing. [20:59:19] You can leave a message via PM. [21:27:57] 10DBA, 06Operations, 07Tracking: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2819072 (10Volans) [23:55:26] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2819574 (10hoo) There already is a separate table with redirect information (`redirect`), the `page_is_redirect` field is mostly for convenie... [23:59:33] 10DBA, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10MediaWiki-Database, and 2 others: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255#2819587 (10Ejegg)