[00:34:56] <bd808>	 jynus: There is no heartbeat_p database on wikireplica-{web,analytics}.eqiad.wmnet. Is this just a missing view issue or will the replag be tracked some other way on these hosts?
[07:33:00] <jynus>	 bd808: it should be available now
[07:33:08] <jynus>	 (heartbeat_p)
[08:43:26] <wikibugs>	 10DBA, 10Operations, 10Performance-Team, 10Availability (Multiple-active-datacenters): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3599592 (10aaron)
[08:43:32] <wikibugs>	 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3599604 (10jcrespo) Those were some false positives, checking the rest of the hosts now.
[08:44:57] <wikibugs>	 10DBA, 10Operations, 10Performance-Team, 10Availability (Multiple-active-datacenters): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3599592 (10aaron)
[08:49:27] <wikibugs>	 10DBA, 10Operations, 10Performance-Team, 10Availability (Multiple-active-datacenters): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3599616 (10jcrespo) I can help with this, but I will need supervision to understand the...
[09:14:30] <wikibugs>	 10DBA, 10Operations, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3599667 (10Gilles)
[11:50:36] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3599905 (10jcrespo)
[11:50:38] <wikibugs>	 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3599903 (10jcrespo) 05Open>03Resolved I am not fairly confident that the main tables on most relevant servers are the same. I have not checked and fixed every table and every server, but it should be good enough to a...
[11:55:44] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3599907 (10jcrespo) p:05High>03Low 5 got rebuilt correctly, let's go with **Slot Number: 0** now (much lower priority). It has 12K errors.
[12:04:59] <wikibugs>	 10DBA, 10Phabricator: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3599964 (10jcrespo)
[12:05:23] <wikibugs>	 10DBA, 10Phabricator: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3599978 (10jcrespo)
[12:05:26] <wikibugs>	 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3599979 (10jcrespo)
[12:05:57] <wikibugs>	 10DBA, 10Phabricator: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3599964 (10jcrespo)
[12:06:00] <wikibugs>	 10DBA, 10Operations, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3599981 (10jcrespo)
[13:14:54] <wikibugs>	 10DBA: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3600184 (10jcrespo)
[13:15:09] <wikibugs>	 10DBA: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3600198 (10jcrespo)
[13:15:11] <wikibugs>	 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3600199 (10jcrespo)
[14:33:35] <bd808>	 thanks jynus. I'm working on a lag display for the new replicas -- https://tools.wmflabs.org/replag/newdb.php
[14:33:55] <jynus>	 cool
[14:34:07] <jynus>	 is that live?
[14:35:12] <jynus>	 I wonder if I could add now decimal precision to the tables
[14:49:46] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3600623 (10Papaul) Disk replacement in slot 0 complete
[14:50:25] <wikibugs>	 10DBA, 10MediaWiki-extensions-Renameuser: Fix use of DB schema so RenameUser is trivial - https://phabricator.wikimedia.org/T33863#3600640 (10Legoktm)
[14:53:06] <jynus>	 bd808: what do you think? https://gerrit.wikimedia.org/r/377480
[14:53:30] <jynus>	 we deploy the non-workaround code on the new labs hosts only
[14:54:02] <bd808>	 neat.
[14:54:19] <bd808>	 now people can obsess over milliseconds of lag! ;)
[14:54:30] <jynus>	 heads up in case decimal break interger aritmetic
[14:54:40] <jynus>	 you can keep showing integers
[14:54:56] <jynus>	 and flooring them (it has a +1, -1 second error)
[14:55:28] <jynus>	 it is not really useful, but the underlying bug was quite nasty
[14:55:47] <jynus>	 and we actually have plans to reduce to sub-second replication control
[14:57:02] <bd808>	 I can see how that would make sense in the main cluster. For Data Services I think as long as we are within a few minutes most things should work fine.
[14:57:20] <bd808>	 I don't like the idea of bots polling the DB for realtime things honestly.
[14:57:32] <bd808>	 we have better feeds for realtime data
[14:57:40] <jynus>	 yes, the subsecond is for production
[14:57:53] <jynus>	 it is just that they are all part of the same group
[14:58:02] <jynus>	 we cannot do one without the other
[14:58:16] <bd808>	 yeah
[14:58:42] <jynus>	 in fact, for the analytics service I would like to see in the future more lag
[14:58:47] <jynus>	 if that would help with performance
[14:59:01] <jynus>	 e.g. think of being in sync once per day
[14:59:26] <jynus>	 not sure if for labs, but for analytics/research hosts
[15:00:07] <jynus>	 leaving web for things like "edit count"
[15:00:18] <jynus>	 where in principle lag could be more impacting
[15:00:49] <bd808>	 *nod* there are so many use cases. Its hard to think of the best way to cover each one.
[15:01:02] <jynus>	 yeah, this are only potential ideas
[15:01:07] <jynus>	 nothing planned
[15:01:24] <bd808>	 if we only had infinite time and hardware :)
[15:01:30] <jynus>	 but I think if we do not replicate, but preload data, we could actually have more shards on a single host
[15:09:12] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3600731 (10jcrespo)
[15:10:43] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3587259 (10jcrespo) Still on Firmware state: Rebuild, we will wait a bit for the next one. (I am a bit more cautions than I have to be due to the RAID 10 because the disks are not new, so there is a change f...
[15:18:44] <bd808>	 jynus: the heartbeat_p tables seem to have disappeared
[15:18:55] <jynus>	 one sec
[15:19:32] <jynus>	 https://gerrit.wikimedia.org/r/#/c/377484/
[15:27:15] <jynus>	 bd808: check now
[15:28:49] <jynus>	 I am not sure how useful having the full wiki list is
[15:28:57] <jynus>	 IF you plan to add wikis
[15:29:03] <jynus>	 I would add the non-s3 ones
[15:29:15] <jynus>	 and summarize s3 as "other wikis"
[15:29:28] <jynus>	 it can stay like that, much cleaner, I think
[15:30:52] <jynus>	 If someone wants to know which wikis are where, we can link to https://noc.wikimedia.org/db.php or the meta_p documentation
[15:45:05] <bd808>	 jynus: its working again, thanks. And yeah I want to rethink how the data is displayed
[16:04:23] <wikibugs>	 10DBA, 10Phabricator, 10Patch-For-Review: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3599964 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1059.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/2017...
[16:07:31] <wikibugs>	 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423#3600920 (10jcrespo)
[16:07:34] <wikibugs>	 10DBA, 10Patch-For-Review: Migrate dbstore2001 to multi instance - https://phabricator.wikimedia.org/T168409#3600919 (10jcrespo) 05Open>03Resolved
[16:12:00] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3600929 (10jcrespo) db1069 has been reused on s7, probably we should chose db1066 instead.
[16:26:45] <wikibugs>	 10DBA, 10CheckUser, 10Patch-For-Review: The "show ip" action should also provide a distinct list of user-agents for each IP - https://phabricator.wikimedia.org/T170508#3433923 (10jcrespo) See comment on gerrit, it helps with speeding up reviews :-).
[16:29:00] <wikibugs>	 10DBA, 10Wikimedia-Hackathon-2017, 10Wikimedia-Site-requests, 10Documentation, 10MediaWiki-SWAT-deployments: Create summary templates on Wikitech wiki to stop writing the same things everywhere, everytime - https://phabricator.wikimedia.org/T165756#3600983 (10jcrespo) p:05Normal>03Low I am not saying...
[16:39:24] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3601037 (10jcrespo) 0 is Online, Spun UP. Next one should be **Span: 1**
[17:08:54] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3601142 (10Papaul) Disk in slot 2 replacement complete.
[17:24:22] <wikibugs>	 10DBA, 10Phabricator, 10Patch-For-Review: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3601199 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1059.eqiad.wmnet'] ```  and were **ALL** successful.
[17:28:38] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3601239 (10Volans)
[22:13:39] <musikanimal>	 any DBAs around? https://gerrit.wikimedia.org/r/#/c/349457/ just went out with group0, and we ran the maintenance script to backfill the ip_changes table. All seems well but we thought we'd check with you to see if everything is OK on your side?
[22:21:41] <Reedy>	 musikanimal: I can't see any replag obvious on https://dbtree.wikimedia.org/
[22:22:28] <musikanimal>	 okay thanks :)