[07:39:49] <wikibugs_>	 10DBA, 10Phabricator, 07Upstream: Editing a recurring event overrides all past instances - https://phabricator.wikimedia.org/T151228#2817113 (10mmodell) So from upstream, it seems that phabricator is just behaving the same way as other calendar apps and the language about "future events" is just potentially...
[07:40:43] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817114 (10Marostegui) As we spoke yesterday.  I am using db1052 (which was depooled yesterday) to import S1's tablesspace to db109...
[07:51:24] <wikibugs_>	 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2817119 (10Marostegui) Thanks - doing it now.
[07:55:23] <wikibugs>	 10DBA: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817124 (10Marostegui) Given that m3 isn't big (100G) I can either reimport that table or the whole tablespace from db2012
[09:03:02] <wikibugs_>	 10DBA, 06Operations: Review Icinga alarms with disabled notifications - https://phabricator.wikimedia.org/T149643#2817177 (10Volans) p:05Triage>03Normal
[09:45:20] <wikibugs>	 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2817207 (10Marostegui) Hey @Papaul   The server crashed again - same symptoms:  First attempt:  ``` iLO Advanced 2.40 at  Dec 02 2015 Server Name: WIN-12 Server Power: Off ```  ``` </system1/log1>hp...
[09:49:39] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817211 (10jcrespo) dbstore1001 doesn't use GTID, and it is a delayed slave that starts replication automatically, so it is not sim...
[09:51:04] <marostegui>	 ^ jynus then we probably need to use another slave to test row...
[09:51:31] <jynus>	 why?
[09:52:38] <marostegui>	 that or we migrate dbstore1001 to the master, or you are confident that if we change the binlog format to row dbstore1001 won't break?
[09:52:46] <wikibugs>	 10DBA: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817218 (10jcrespo) We should stop the slaves in sync and import at least the tables on phabricator_conduit.
[09:53:22] <jynus>	 why do you think dbstore1001 will break in row based replication?
[09:54:01] <marostegui>	 I don't know if it will break, but if the data isn't the same it might, that is why I am asking you :)
[09:54:29] <wikibugs_>	 10DBA: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817219 (10Marostegui) That also works. We can stop db1048's replication for a few seconds let the slaves reach the same position, stop them, a...
[09:54:32] <jynus>	 if data isn't the same we would want to know, given that it is our backup system
[09:55:07] <jynus>	 but we have to move it at some point to the final master, don't you think?
[09:55:14] <marostegui>	 of course
[10:05:00] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817243 (10hoo) >>! In T151356#2815197, @jcrespo wrote: > This is the goal, but we will try to achieve this without the force index, dependin...
[10:15:28] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817271 (10jcrespo) The problem is not how to do the force, that can be added directly, the problem is that force index is a poor workaround,...
[10:18:24] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817274 (10jcrespo) a:03jcrespo I will move the reminder dbstore1001 replication channels to the right master, hopefully not brea...
[10:18:40] <wikibugs>	 10DBA: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817276 (10Marostegui) a:03Marostegui
[10:23:58] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817294 (10hoo) p:05Triage>03High
[10:31:08] <marostegui>	 jynus: is it fine to kill pt-heartbeat on db1048 for a few seconds so it doesn't write to the binlog and I can stop the slaves at the same position?
[10:31:39] <jynus>	 go to the screen and see if it has logged any query
[10:32:28] <marostegui>	 which screen?
[10:32:48] <jynus>	 how many screens does db1048 have?
[10:32:56] <marostegui>	 0
[10:33:10] <marostegui>	 root@db1048:~# screen -ls
[10:33:11] <marostegui>	 No Sockets found in /var/run/screen/S-root.
[10:33:11] <marostegui>	 root@db1048:~#
[10:33:35] <jynus>	 pt-heartbeat?
[10:33:50] <jynus>	 it should not be running there
[10:33:56] <jynus>	 it should be running on the master
[10:34:04] <jynus>	 check the puppetization
[10:34:09] <marostegui>	 oki
[10:34:15] <marostegui>	 maybe that was an old master
[10:34:16] <marostegui>	 I will check
[10:38:44] <marostegui>	 it is marked as master whereas db1043 isn't
[10:38:53] <marostegui>	 according to tendril db1043 is the master of m3
[10:42:35] <wikibugs_>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817362 (10daniel) So we have the choice of forcing a bad query plan ourselves, or leaving it to Maria to pick a bad query plan...
[10:45:58] <marostegui>	 db1043 isn't running the pt-heartbeat but I do believe it is the master as the processlist shows connections from other hosts
[10:47:07] <marostegui>	 the dbproxy configurations shows the right master (db1043)
[10:50:33] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817404 (10jcrespo) > So we have the choice of forcing a bad query plan ourselves, or leaving it to Maria to pick a bad query plan...  No, wh...
[11:00:24] <wikibugs_>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817425 (10ArielGlenn)
[11:51:09] <wikibugs_>	 10DBA, 13Patch-For-Review: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817496 (10Marostegui) The table `phabricator_conduit.conduit_methodcalllog` has been reimported from db2012 to dbstore10...
[11:55:01] <wikibugs_>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817499 (10jcrespo) @ArielGlenn don't add yourself to this ticket, as it will be closed soon. Chec...
[12:10:46] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2817556 (10ArielGlenn)
[12:33:46] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817590 (10jcrespo) a:05jcrespo>03None db1052 and the others should be clear to be used.
[13:10:29] <wikibugs_>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817656 (10Marostegui) All done, db1052 is now master of db1095 which is replicating ROW-based rep...
[13:26:58] <wikibugs_>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817685 (10jcrespo) __wmf_checksums can be dropped at any time, it is the table I use for running...
[14:13:11] <wikibugs>	 10DBA, 13Patch-For-Review: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817785 (10jcrespo) 05Open>03Resolved Let's resolve, reopen if something else happens. dbstore1002 is not at all a pr...
[14:16:21] <wikibugs>	 10DBA, 13Patch-For-Review: phabricator_conduit.conduit_methodcalllog failed replicating on dbstore1002, probably m3 needs a reload on that server - https://phabricator.wikimedia.org/T151384#2817789 (10Marostegui) Sounds good. I will leave the table there and add a note in my calendar to drop it in a couple of...
[14:23:25] <wikibugs>	 10DBA: Media errors on db1048 are creating lag - https://phabricator.wikimedia.org/T151039#2817801 (10jcrespo) p:05Triage>03Low This doesn't really look like disk issues. Anyway, the initial problem (lag) is gone.
[14:26:52] <jynus>	 as db1092 is depooled, I will use it to experiment for https://phabricator.wikimedia.org/T151356
[14:27:50] <marostegui>	 Sounds good
[14:27:57] <marostegui>	 Let's see if it crashes again with some activity
[14:28:25] <jynus>	 yes
[14:28:35] <jynus>	 I will upgrade it now that it is depooled
[14:28:46] <jynus>	 and reboot it once for the new kernel
[14:28:57] <jynus>	 and play (safely) with it
[14:29:07] <jynus>	 I assume you are not doing anything with it
[14:29:18] <jynus>	 so this was just a heads up
[14:29:18] <marostegui>	 nop
[14:29:29] <marostegui>	 it is all yours
[14:29:42] <marostegui>	 I will pool it back on Monday most likely if it goes fine for the next few days
[14:29:54] <jynus>	 I may be too annoying
[14:30:01] <marostegui>	 why? 
[14:30:08] <jynus>	 but 2 dbas on the same machine is a receipe for disaster
[14:30:13] <marostegui>	 ah no
[14:30:27] <marostegui>	 It is better to ask before destryoings other's work!
[14:30:27] <marostegui>	 :)
[14:30:28] <jynus>	 so I give a heads up when that
[14:30:31] <jynus>	 exactly
[14:30:39] <jynus>	 consider however irc as asyncronous
[14:30:45] <jynus>	 unless I ping you
[14:31:14] <marostegui>	 ok :)
[14:31:17] <marostegui>	 still working on that :p
[14:37:32] <wikibugs>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2817856 (10Marostegui) >>! In T150960#2817685, @jcrespo wrote: > __wmf_checksums can be dropped at...
[14:53:01] <wikibugs_>	 10DBA, 06Operations, 10ops-codfw, 13Patch-For-Review: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2817901 (10Marostegui) This has been running fine for 6 days already. Once the deploys are un blocked, I will pool it back.
[15:17:09] <jynus>	 I am enabling histogram creation on db1092 as a test for a very common query optimization problem we suffer
[15:19:09] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2817971 (10jcrespo) If we have to fail back to index hinting, this should be preferred -ignore rather than force: ``` MariaDB [wikidatawiki]>...
[15:30:23] <wikibugs>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2818008 (10jcrespo) The histograms are still not enough to convince T151356#2817971 to not optimize the `page_is_redirect` condition. My advi...
[16:03:23] <wikibugs>	 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2818068 (10Marostegui) Papaul and myself have been having a chat and we are going to try a few things with this host. The first one, moved it to a different PDU. I am going to try to crash it now
[16:41:36] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2818173 (10Marostegui)
[16:44:42] <wikibugs_>	 10DBA, 06Operations, 10ops-eqiad: Degraded RAID on db1053 - https://phabricator.wikimedia.org/T151465#2818166 (10Marostegui) This is correct, disk `32:4` has been set as offline and needs to be replaced:  ``` RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0 Size                : 3.271 TB...
[16:46:47] <jynus>	 watch "mysql --skip-ssl -e \"SELECT greatest(0, TIMESTAMPDIFF(MICROSECOND, ts, UTC_TIMESTAMP(6)) - 500000)/1000000 AS lag FROM heartbeat.heartbeat WHERE shard='s4' and datacenter='eqiad'\""
[16:47:01] <jynus>	 shows 0 lag but the metrics still show spikes
[16:47:03] <marostegui>	 running it
[16:47:25] <marostegui>	 but 1 seconds one, right?
[16:47:33] <marostegui>	 (not saying it is normal)
[16:47:50] <jynus>	 it has +-0.5 measuring error
[16:54:28] <marostegui>	 it keeps showing spikes but that query shows 0 all the time (almost, never more than 1 sec)
[16:54:31] <marostegui>	 weird
[16:55:02] <jynus>	 let's wait
[16:55:08] <jynus>	 metrics have a 5 minute lag
[17:09:17] <wikibugs_>	 10DBA, 10Datasets-General-or-Unknown, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Initial data tests for db1095 (temporary db1069 - sanitarium replacement) - https://phabricator.wikimedia.org/T150960#2818230 (10Marostegui) Script finished and now db1095 is trying to catch up with the master (I sto...
[17:09:26] <wikibugs_>	 10DBA, 06Operations, 06Performance-Team, 07Availability, 07Wikimedia-Multiple-active-datacenters: Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809#2277856 (10fgiunchedi) re: certificate handling that @jcrespo mentioned, see also {T150822} for the related...
[17:37:32] <wikibugs_>	 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2818317 (10Marostegui) Good news, the first attempt was successful and the server DID NOT crash!  I am going to try again just to make sure it was not coincidence or luck.
[17:38:47] <jynus>	 yay
[17:39:32] <marostegui>	 :)
[18:47:56] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Implement a frontend failover solution for labsdb replicas - https://phabricator.wikimedia.org/T141097#2818509 (10jcrespo) a:03jcrespo dbproxy1010 and dbproxy1011 are now serving as proxies for labsdb1009/10/11 on the labs-support network (they ju...
[18:48:05] <marostegui>	 ^ \o/
[18:51:25] <wikibugs>	 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, and 2 others: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2818541 (10dpatrick)
[18:52:01] <wikibugs_>	 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, and 2 others: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2798143 (10dpatrick) @Bawolff, can you take at this?
[19:00:01] <wikibugs>	 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2818584 (10Marostegui) The second attempt crashed the server (but it took a lot longer than usual)  What Papaul and myself have agreed on as next steps is to change the PSU and see what happens. How...
[19:00:05] <wikibugs>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Implement a frontend failover solution for labsdb replicas - https://phabricator.wikimedia.org/T141097#2487045 (10jcrespo)
[19:00:07] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure, 06Operations, and 3 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2818586 (10jcrespo) 05Open>03Resolved a:05jcrespo>03Cmjohnson The servers are wo...
[20:02:13] <wikibugs_>	 10DBA, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Implement a frontend failover solution for labsdb replicas - https://phabricator.wikimedia.org/T141097#2818714 (10chasemp) How is the haproxy layer failed over (between nodes) in prod atm?  LVS or ucarp/VRRP or ?
[20:13:52] <wikibugs_>	 10DBA, 06Community-Tech, 06Labs, 10MediaWiki-extensions-PageAssessments, and 2 others: Replicate page_assessments and page_assessments_projects tables on Labs - https://phabricator.wikimedia.org/T150832#2818768 (10Bawolff) a:03chasemp This looks fine. +1 from security.
[20:57:45] <Cyberpower678>	 volans: are you online?
[20:57:59] <Cyberpower678>	 My connection died as I was trying to communicate with you.
[20:58:48] <Cyberpower678>	 If not, just let me know if my cyberbot DB can be set to read only mode.  I want to test some error handling of IABot, and the GUI interface I am writing.
[20:59:19] <Cyberpower678>	 You can leave a message via PM.
[21:27:57] <wikibugs_>	 10DBA, 06Operations, 07Tracking: Icinga MariaDB disk space check on silver checks the wrong partition - https://phabricator.wikimedia.org/T151491#2819072 (10Volans)
[23:55:26] <wikibugs_>	 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata: Wikibase\Repo\Store\Sql\SqlEntityIdPager::fetchIds query slow - https://phabricator.wikimedia.org/T151356#2819574 (10hoo) There already is a separate table with redirect information (`redirect`), the `page_is_redirect` field is mostly for convenie...
[23:59:33] <wikibugs>	 10DBA, 10Beta-Cluster-Infrastructure, 10Continuous-Integration-Infrastructure, 10MediaWiki-Database, and 2 others: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255#2819587 (10Ejegg)