[01:09:00] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#2683366 (10Legoktm) [06:34:29] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2683543 (10Marostegui) S3 tables have been converted to InnoDB. [06:38:34] 10DBA, 06Operations, 10ops-eqiad: db1055: degraded array - https://phabricator.wikimedia.org/T147172#2683544 (10Marostegui) [06:43:12] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2683563 (10Marostegui) S4 (commonswiki database) table has been converted [06:47:23] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2683567 (10Marostegui) S5 (dewiki and wikidatawiki) table converted [06:50:57] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2683573 (10Marostegui) S6 (frwiki, jawiki, ruwiki) converted [06:53:14] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2683575 (10Marostegui) S7 tables converted [06:56:29] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2683577 (10Marostegui) [06:56:31] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: db1069: convert user_groups table to InnoDB across all the wikis - https://phabricator.wikimedia.org/T146121#2683576 (10Marostegui) 05Open>03Resolved [07:41:35] marostegui, do you want to own T147113 or should I do? [07:41:49] jynus: I do it [07:41:53] let me review the plan with you [07:42:07] no need, I know you will be doing ok- depool, apply [07:42:15] it may take some hours, though [07:42:19] yeah [07:42:26] that was what I would do :) [07:42:34] unbreak now, but it has been like that for 6 months [07:42:40] take that into account [07:42:47] but let's fix it ASAP [07:42:53] there are a few slaves without the PK (and the master) [07:43:00] so I will start now with the first slave (db1081) [07:43:01] commons master? [07:43:04] yes [07:43:11] those are the 90% of the traffic [07:43:16] yep [07:43:17] so that is the reason [07:43:21] I see in tendril [07:43:35] for some reason, a primary key was never added to commons [07:43:43] We will fix it now then :) [07:44:37] I would say that after labs goal, we should set T104459 as UBN [07:44:37] T104459: Automatize the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 [08:00:27] jynus: just to have another pair of eyes, does this look good to you?: #/software/dbtools/osc_host.sh --host=db1081.eqiad.wmnet --port=3306 --db=commonswiki --table=revision --method=ddl --no-replicate "ADD PRIMARY KEY (rev_id,rev_user);"; [08:01:30] if it is depooled, you can even run it directly [08:01:58] downtime the replication, just in case [08:02:03] (altert) [08:02:06] good idea [08:02:11] wait [08:02:15] about the alter [08:02:26] tell me [08:02:45] shouldn't it be just converting the existing unique key into a PK [08:03:47] the unique key has only rev_id column [08:05:08] mmm [08:05:14] let me check something [08:05:32] as the PK of one of the codfw is PRIMARY KEY (`rev_page`,`rev_id`), [08:05:54] mmm [08:05:57] check tables.sql [08:06:07] I think it should be rev_id only? [08:06:19] let's check mediawiki [08:06:25] and the other hosts [08:07:13] there are different PK across the hosts, most of them have rev_page rev_id, but some others have rev_id, rev_user [08:08:59] jynus: https://phabricator.wikimedia.org/P4143 [08:09:03] rev_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT [08:09:10] accoording to tables.sql [08:09:24] I know we have rev_id, rev_user on the partitioned tables [08:09:29] on purpose [08:09:32] ah ok [08:10:17] wow [08:10:41] this is https://phabricator.wikimedia.org/T132416 [08:11:38] the problem now is how many of those are mistakes/old things [08:11:50] and qwhich ones were changed on purpose [08:12:00] by someone, but did not update mediawiki indexes [08:12:48] "6 servers (the non-contributions eqiad slaves: db1053, db1057, db1065, db1066, db1072, db1073) have PRIMARY KEY(rev_id). This is the sane thing to do, and what MediaWiki's tables.sql prescribes." [08:12:55] ^I would go for that [08:13:06] if it is wrong, it is mediawiki's fault [08:13:22] someone clearly forked mediawiki code at some point [08:13:27] but did not document it properly [08:13:39] jynus: some of the wrong ones where not on purpose because of wrong query plans on API servers? [08:14:01] jynus: I see, that looks good to me [08:14:06] whith a very broad definition of "on purpose" ;) [08:14:48] jynus: Then we can just drop the UNIQUE key and in the same transaction add the PK [08:15:57] yes, that was my initial idea [08:16:07] I am not 100% convinced of it [08:16:25] but if we break things, let's break them while we do a step forward [08:16:28] not backwards [08:17:11] maybe the older index has to be kept? [08:17:52] no, clearly just unique -> PK [08:18:07] I belive what happened, someone thought that UNIQUE == PK [08:18:22] which may be true at InnoDB side [08:18:33] but not at SQL layer [08:18:46] I know more than one person making that mistake [08:18:54] it is either that [08:19:05] or a schema was rolled everywhere but the masters [08:19:09] and then it propagated [08:19:38] so let go for the "DROP rev_id, ADD PRIMARY KEY (rev_id)" [08:19:45] marostegui, agree? [08:19:54] Yeah, I was testing one thing, give me a sec [08:19:58] with a vanilla table [08:22:50] jynus: ok, looks fine then this would be the final command: #/software/dbtools/osc_host.sh --host=db1081.eqiad.wmnet --port=3306 --db=commonswiki --table=revision --method=ddl --no-replicate "DROP index rev_id, ADD PRIMARY KEY (rev_id);"; [08:22:58] +1 [08:23:11] the box is depooled (no connections there) so we are good to go [08:23:17] I will update the ticket with the findings [08:37:41] 07Blocked-on-schema-change, 10DBA, 07Schema-change: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#2683666 (10jcrespo) Thank you, @Legoktm to take the time to file the ticket. Thanks to that, this will not be missed and will be applie... [08:38:01] 07Blocked-on-schema-change, 10DBA: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166#2683667 (10jcrespo) [08:53:18] I see the alter running 30 minutes ago [08:53:38] Mine or yours? [08:53:52] mine is still running yes [08:54:03] and that is the only one in the activity page :) [08:54:10] I updated the deployments page by the way [09:09:07] good, thank you for that [09:09:13] you are the captain this week :-) [09:09:58] http://67.media.tumblr.com/9ba8fe087f2bc0eda23f424bffb20fb0/tumblr_inline_odd3dlLIY31raprkq_500.gif - you looking at me this week [09:10:41] rotfl [09:18:52] db1081 was using single-tablespace :-/ [09:19:08] jynus: yes :_( [09:19:10] the temporary table is now 47 GB in size [09:20:13] logical size seems to be around 70GB [09:20:35] so 60% done in 1 hour [09:21:23] probably you will be able to get this done today or tomorrow [09:21:39] I will take care of the pending phabricator search issues this afternoon [09:23:16] thanks [09:30:23] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2655213 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['dbstore2001.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/20161... [09:35:53] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2683728 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbstore2001.codfw.wmnet'] ``` Those hosts were successful: ``` [] ``` [09:38:32] "Those hosts were successful: []" :-/ [09:38:50] Yeah [09:38:55] I am troubleshooting with elukey [09:39:34] actually, I see puppet running ok [09:39:53] maybe it was aborted, because I wasn't asked for a new key on ssh [09:40:25] yep: "up 285 days" [09:41:53] yeah, that is the first thing I checked XD [09:41:57] is seems that it failed to downtime icinga? [09:42:23] marostegui: was it already in downtime? [09:42:25] db1081 looks ok [09:42:31] it finished [09:42:41] index is there [09:42:53] Ah nice! I will pool it [09:42:56] volans: yes [09:43:00] there is an open bug unfortunately: T145192 [09:43:01] T145192: icinga-downtime script waiting forever if host already in downtime - https://phabricator.wikimedia.org/T145192 [09:43:05] * elukey blames Riccardo [09:43:12] :P [09:43:20] ehehehe [09:43:21] marostegui, pool it with low weight [09:43:26] yep [09:44:24] jynus: 100 sounds good? [09:44:34] volans: so should I remove the downtime? [09:44:51] marostegui: the quick fix is to remove the downtime from icinga (downtime page from the left) the host downtime should be sufficient [09:44:57] I can see there are 3 [09:45:07] you can checkbox it all and delete on top-right [09:45:27] and relaunch the reimage [09:45:50] "the host downtime should be sufficient" <-- you can blame pages on him if they fail :-) [09:45:59] haha [09:46:04] you have to be quick ;) [09:46:11] I have removed them [09:46:12] do it when already at the IPMI prompt [09:46:17] going to relaunch it again then :) [09:46:50] volans, we will try to reload 12 TB of data in a couple of hours, cannot guarantee anything [09:47:07] jynus: what do you mean? [09:47:15] when the downtime expires [09:47:34] server (their services) will likely not be in a good state, sadly [09:48:03] I said quick in case it was already down right now, I set a 4h downtime for reimages right now [09:48:15] in the script [09:48:20] jynus: Going to add db1081 back with 100 weight [09:48:23] but you can manually add another one longer if needed [09:48:29] yes, that should be enough [09:48:31] after the reimage starts [09:48:38] we just to remember that [09:48:41] *need [09:49:12] also icinga-downtime set the downtime only of the host not the services... I think it should do both [09:49:38] yes, although I remember having such a discussion some time ago with someone else [09:50:04] I think that was the original intention of that script [09:50:19] (not yours, the downtime one) [09:51:09] doing only hosts or both? [09:51:23] both host and all the services [09:51:27] jynus: https://gerrit.wikimedia.org/r/#/c/313787/ ? [09:51:47] marostegui, go on [09:51:58] monitor kibana carefuly [09:52:05] I never had alarming fired using icinga-downtime, weird [09:52:05] yep [09:52:08] filtering on that server ips [09:52:20] elukey, services depend on the host [09:52:24] elukey: if you reboot a host is ok [09:52:37] but if you do icinga-downtime and then just stop a service it should alarm [09:52:39] however, if host is up but services fail [09:52:40] as normal [09:52:47] ^exactly [09:52:50] ah didn't know that [09:52:52] thanks :) [09:53:06] which is more typical than usual on mysql, because provisioning usually take hours [09:53:21] while most other services are ready a few minutes after being up [09:53:37] is the difference between using SCHEDULE_HOST_DOWNTIME (current) and SCHEDULE_HOST_SVC_DOWNTIME [09:54:21] probably we should use operations for this conversation, last thing I want is to be called out on "yet another channel to monitor" [09:54:43] true [09:55:11] let's stick this channel for pure db work/coordination [10:04:02] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2683790 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['dbstore2001.codfw.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/20161... [10:31:55] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2683871 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbstore2001.codfw.wmnet'] ``` Those hosts were successful: ``` ['dbstore2001.codfw.wmnet'] ``` [10:34:38] 10DBA, 13Patch-For-Review: Reimage dbstore2001 as jessie - https://phabricator.wikimedia.org/T146261#2683874 (10Marostegui) So this host is now running Jessie ``` marostegui@dbstore2001:~$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 8.6 (jessie) Release... [11:37:37] 10DBA, 06Operations: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533#2684019 (10Marostegui) @Cmjohnson let me know if you want to proceed with this upgrade sometime this week? This server needs to be depooled first. [14:55:37] jynus: https://phabricator.wikimedia.org/P4146 [14:56:34] not sure how easy would be the master; revision changes were always a pain [14:56:49] you cannot disable the binary log [14:56:53] Yeah, I was doubting, wasn't going to do it without talking to you first :) [14:56:54] for percona [14:57:20] but ddl should work, but it is a ddl, with all it implies [14:57:28] Yeah [14:57:37] probably better to do it one day that we are both in [14:57:39] let me do the final ones [14:57:46] sure thing [14:57:47] the slave [14:57:49] only one pending then [14:57:50] yep [14:58:00] and we can talk about that, etc. [14:58:06] I am increasing the weight of db1084 now, I just pushed it to 300 [14:58:08] that should finish [14:58:17] the issue itself [14:58:22] which was the main problem [14:58:37] yeah :) [14:58:39] thank you a lot for the link [14:59:31] my pleasure! :) [15:02:10] marostegui, I forgot to mention I saw some api queries creating connection issues [15:02:24] I put a watchdof killing long running queries on a couple of api servers [15:03:36] Ah, yes, I read that [15:03:36] I haven't seen bad patterns since that [15:07:44] ups, I think I made you reinstall dbstore2001 twice [15:07:47] sorry [15:07:59] what do you mean? [15:08:11] did not remember mentioning the dhcp change [15:08:29] but it is ok now [15:08:38] also thanks for the sanitarium alter [15:08:49] you are literally right now more prolific than I am [15:08:50] I removed the dhcp lines as per my chat with elukey if that is what you mean: https://gerrit.wikimedia.org/r/#/c/313780/ [15:09:13] yes, I forgot to tell you to do that [15:09:16] sorry [15:09:20] No worries :) [15:09:35] it should have taken you only 5-10 minutes [15:09:40] of unattended work [15:09:45] I hope [15:09:48] yeah [15:09:56] only we had to debug twhy it didn't work [15:10:01] and it was because of the icinga downtime :) [15:10:03] and you will now remember always [15:10:04] :-) [15:10:07] haha indeed [15:10:09] that is strange [15:10:15] but good to know [15:15:50] jynus: I just pushed the change to get db1084 back to weight 500 [15:15:54] So it is all good with that one [15:16:02] I am going to go to the swimming pool now :) [15:16:10] ok [15:16:16] Have a great evening, if you need something ping me on Hangouts or telegram or call me or whatever :) [15:16:23] it is ok because I am here [15:16:44] Cool! Thanks :) [15:16:47] bye [15:26:20] 10DBA, 10MediaWiki-Database, 13Patch-For-Review, 07Schema-change: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441#2684619 (10jcrespo) [16:34:42] 10DBA, 10MediaWiki-extensions-ORES, 06Revision-Scoring-As-A-Service, 15User-Ladsgroup: Ensure ORES data violating constraints do not affect production - https://phabricator.wikimedia.org/T145356#2684844 (10Halfak) [16:38:45] 10DBA, 10Beta-Cluster-Infrastructure, 06Release-Engineering-Team, 13Patch-For-Review, 07WorkType-Maintenance: Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10 - https://phabricator.wikimedia.org/T138778#2684858 (10dduvall) [20:16:06] I am running now the alter on db1091 [20:16:17] after depoling it [20:21:36] 07Blocked-on-schema-change, 10DBA: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090#2685858 (10jcrespo) So this change has been deployed to all hosts, but when I was finishing it, I realized I missed the imagelinks reorder. No time has been wasted, but I have to recreate that table... [20:24:10] 07Blocked-on-schema-change, 10DBA, 10Wikimedia-Site-requests, 06Wikisource, and 2 others: Schema change for page content language - https://phabricator.wikimedia.org/T69223#2685860 (10jcrespo) Because T139090 needs more work and other more urgent changes (plus a week of deployment freeze) I have to delay t... [21:37:22] 10DBA, 06Labs, 10Labs-Infrastructure, 07Upstream: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2686158 (10jcrespo) I would say T146121 mitigates this issue by a lot. Maybe closing it as resolved or stalled, unless we expect a proper fix from upstream? [22:45:30] db1091 done [22:45:36] and repooled [23:48:09] 10DBA, 06Collaboration-Team-Triage, 06Community-Tech-Tool-Labs, 10Flow, and 5 others: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2686484 (10Mattflaschen-WMF) We should either: * Update wikitech-static scripts immediat...