[00:34:56] jynus: There is no heartbeat_p database on wikireplica-{web,analytics}.eqiad.wmnet. Is this just a missing view issue or will the replag be tracked some other way on these hosts? [07:33:00] bd808: it should be available now [07:33:08] (heartbeat_p) [08:43:26] 10DBA, 10Operations, 10Performance-Team, 10Availability (Multiple-active-datacenters): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3599592 (10aaron) [08:43:32] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3599604 (10jcrespo) Those were some false positives, checking the rest of the hosts now. [08:44:57] 10DBA, 10Operations, 10Performance-Team, 10Availability (Multiple-active-datacenters): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3599592 (10aaron) [08:49:27] 10DBA, 10Operations, 10Performance-Team, 10Availability (Multiple-active-datacenters): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3599616 (10jcrespo) I can help with this, but I will need supervision to understand the... [09:14:30] 10DBA, 10Operations, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar): Make client certs available for apache/maintenance hosts for TLS connections to mariadb - https://phabricator.wikimedia.org/T175672#3599667 (10Gilles) [11:50:36] 10DBA, 10Operations, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3599905 (10jcrespo) [11:50:38] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3599903 (10jcrespo) 05Open>03Resolved I am not fairly confident that the main tables on most relevant servers are the same. I have not checked and fixed every table and every server, but it should be good enough to a... [11:55:44] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3599907 (10jcrespo) p:05High>03Low 5 got rebuilt correctly, let's go with **Slot Number: 0** now (much lower priority). It has 12K errors. [12:04:59] 10DBA, 10Phabricator: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3599964 (10jcrespo) [12:05:23] 10DBA, 10Phabricator: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3599978 (10jcrespo) [12:05:26] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3599979 (10jcrespo) [12:05:57] 10DBA, 10Phabricator: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3599964 (10jcrespo) [12:06:00] 10DBA, 10Operations, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3599981 (10jcrespo) [13:14:54] 10DBA: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3600184 (10jcrespo) [13:15:09] 10DBA: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3600198 (10jcrespo) [13:15:11] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3600199 (10jcrespo) [14:33:35] thanks jynus. I'm working on a lag display for the new replicas -- https://tools.wmflabs.org/replag/newdb.php [14:33:55] cool [14:34:07] is that live? [14:35:12] I wonder if I could add now decimal precision to the tables [14:49:46] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3600623 (10Papaul) Disk replacement in slot 0 complete [14:50:25] 10DBA, 10MediaWiki-extensions-Renameuser: Fix use of DB schema so RenameUser is trivial - https://phabricator.wikimedia.org/T33863#3600640 (10Legoktm) [14:53:06] bd808: what do you think? https://gerrit.wikimedia.org/r/377480 [14:53:30] we deploy the non-workaround code on the new labs hosts only [14:54:02] neat. [14:54:19] now people can obsess over milliseconds of lag! ;) [14:54:30] heads up in case decimal break interger aritmetic [14:54:40] you can keep showing integers [14:54:56] and flooring them (it has a +1, -1 second error) [14:55:28] it is not really useful, but the underlying bug was quite nasty [14:55:47] and we actually have plans to reduce to sub-second replication control [14:57:02] I can see how that would make sense in the main cluster. For Data Services I think as long as we are within a few minutes most things should work fine. [14:57:20] I don't like the idea of bots polling the DB for realtime things honestly. [14:57:32] we have better feeds for realtime data [14:57:40] yes, the subsecond is for production [14:57:53] it is just that they are all part of the same group [14:58:02] we cannot do one without the other [14:58:16] yeah [14:58:42] in fact, for the analytics service I would like to see in the future more lag [14:58:47] if that would help with performance [14:59:01] e.g. think of being in sync once per day [14:59:26] not sure if for labs, but for analytics/research hosts [15:00:07] leaving web for things like "edit count" [15:00:18] where in principle lag could be more impacting [15:00:49] *nod* there are so many use cases. Its hard to think of the best way to cover each one. [15:01:02] yeah, this are only potential ideas [15:01:07] nothing planned [15:01:24] if we only had infinite time and hardware :) [15:01:30] but I think if we do not replicate, but preload data, we could actually have more shards on a single host [15:09:12] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3600731 (10jcrespo) [15:10:43] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3587259 (10jcrespo) Still on Firmware state: Rebuild, we will wait a bit for the next one. (I am a bit more cautions than I have to be due to the RAID 10 because the disks are not new, so there is a change f... [15:18:44] jynus: the heartbeat_p tables seem to have disappeared [15:18:55] one sec [15:19:32] https://gerrit.wikimedia.org/r/#/c/377484/ [15:27:15] bd808: check now [15:28:49] I am not sure how useful having the full wiki list is [15:28:57] IF you plan to add wikis [15:29:03] I would add the non-s3 ones [15:29:15] and summarize s3 as "other wikis" [15:29:28] it can stay like that, much cleaner, I think [15:30:52] If someone wants to know which wikis are where, we can link to https://noc.wikimedia.org/db.php or the meta_p documentation [15:45:05] jynus: its working again, thanks. And yeah I want to rethink how the data is displayed [16:04:23] 10DBA, 10Phabricator, 10Patch-For-Review: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3599964 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1059.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/2017... [16:07:31] 10DBA, 10Epic: Meta ticket: Migrate multi-source database hosts to multi-instance - https://phabricator.wikimedia.org/T159423#3600920 (10jcrespo) [16:07:34] 10DBA, 10Patch-For-Review: Migrate dbstore2001 to multi instance - https://phabricator.wikimedia.org/T168409#3600919 (10jcrespo) 05Open>03Resolved [16:12:00] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3600929 (10jcrespo) db1069 has been reused on s7, probably we should chose db1066 instead. [16:26:45] 10DBA, 10CheckUser, 10Patch-For-Review: The "show ip" action should also provide a distinct list of user-agents for each IP - https://phabricator.wikimedia.org/T170508#3433923 (10jcrespo) See comment on gerrit, it helps with speeding up reviews :-). [16:29:00] 10DBA, 10Wikimedia-Hackathon-2017, 10Wikimedia-Site-requests, 10Documentation, 10MediaWiki-SWAT-deployments: Create summary templates on Wikitech wiki to stop writing the same things everywhere, everytime - https://phabricator.wikimedia.org/T165756#3600983 (10jcrespo) p:05Normal>03Low I am not saying... [16:39:24] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3601037 (10jcrespo) 0 is Online, Spun UP. Next one should be **Span: 1** [17:08:54] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3601142 (10Papaul) Disk in slot 2 replacement complete. [17:24:22] 10DBA, 10Phabricator, 10Patch-For-Review: Move m3 slave to db1059 - https://phabricator.wikimedia.org/T175679#3601199 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1059.eqiad.wmnet'] ``` and were **ALL** successful. [17:28:38] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3601239 (10Volans) [22:13:39] any DBAs around? https://gerrit.wikimedia.org/r/#/c/349457/ just went out with group0, and we ran the maintenance script to backfill the ip_changes table. All seems well but we thought we'd check with you to see if everything is OK on your side? [22:21:41] musikanimal: I can't see any replag obvious on https://dbtree.wikimedia.org/ [22:22:28] okay thanks :)