[08:33:33] 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Marostegui) I have been taking a look at these indexes on enwiki, and we have two indexes in production that ar... [08:52:09] 10Blocked-on-schema-change, 10DBA, 10Schema-change: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 (10Marostegui) [09:02:49] read_only: "False", expected "True" on staging [09:03:02] yep [09:03:04] I am aware [09:03:20] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491408/ [09:03:24] I started taking a look yesterday [09:04:42] maybe have it default to 1 so it doesn't need so many changes on unrelated code? [09:05:03] yeah, but I guess we need to make that default parametrizable? [09:05:12] ? [09:05:21] ? [09:05:29] staging cannot be read_only [09:05:48] define mariadb::instance ( $read_only = 1, [09:06:01] then you discard the changes that are not staging [09:06:35] not sure what you mean: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491408/11/modules/profile/manifests/mariadb/dbstore_multiinstance.pp ? [09:06:39] e.g. core/multiinstance.pp is untouched [09:06:53] line 135 [09:06:57] dbstore_multiinstance.pp only has that line [09:07:05] the others are not touched [09:07:30] so you mean just commiting line 135? [09:07:44] well, and the define and the template [09:08:45] ah I think I know what you mean [09:09:11] :-) [09:09:23] let me see put a patchset to see if I got you rigfht [09:09:24] right [09:13:48] is this kinda what you meant? https://gerrit.wikimedia.org/r/491713 [09:14:56] yes, much simpler [09:15:03] test it however on the compiler [09:15:06] yeah [09:15:08] doing it now [09:15:17] and indeed, a lot more simplier, thanks for the advise [09:22:37] The compiler looks good! \o/ [09:22:41] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491713/ [09:42:51] ha! [09:43:12] The problem is also that /etc/nagios/nrpe.d/check_mariadb_slave_sql_state_staging.cfg has —check_read_only=1 hardcoded [09:45:56] ? [09:46:34] from what I am seeing $port on modules/mariadb/manifests/monitor_readonly.pp is being done correctly, but not the $read_only [09:46:51] check_mariadb_slave_sql_state doesn't check read only [09:47:05] sorry: check_mariadb_read_only_staging.cfg [09:47:12] copy&paste error :) [09:47:30] did you run puppet there? it has to run first on the host and then on icinga [09:47:48] I did [09:47:55] (also on icinga) [09:48:03] I know it is not hardcoded because we check that it is 0 on the core masters [09:48:11] let me re-schedule the check on icinga [09:48:24] or something else is missing [09:49:08] it is hardocoded, or at least check_mariadb_read_only_staging.cfg didn't change it to 0 [09:49:14] oh, you forgeo to change it on puppet [09:49:29] read_only => 1, [09:49:31] Did I? [09:49:42] it should use the variable [09:49:51] but I wonder why $port is being used correctly [09:50:01] port says $port [09:50:07] read_only says one [09:50:13] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491713/4/modules/mariadb/manifests/instance.pp [09:50:18] ^line 71 [09:50:28] aaaah [09:50:29] you reverted too much :-D [09:50:35] I was checking modules/mariadb/manifests/monitor_readonly.pp [09:50:38] haha indeed [09:54:52] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491720/ now it looks good! [09:55:55] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) [09:57:05] es backups finished [09:57:15] thanks for the help with the read-only btw [09:57:27] a lot simplier now [09:57:52] how long did it take? [09:59:00] ? [09:59:09] the es backup [09:59:10] what do you mean? [09:59:23] oh, sorry, got thrown out of context [09:59:27] haha [09:59:38] I was like, but you did that! [10:00:08] I think you can sort-of see it here https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-2d&to=now&var-server=es2003&var-datasource=codfw%20prometheus%2Fops&var-cluster=mysql [10:00:08] Actually, you got me out of a loop where I was getting deeper and deeper on the change and not seeing simple things :-) [10:00:33] so your change was maybe better, but my suggestion was simpler which minimizes risk [10:00:49] ah nice, around 12h the backup took [10:02:41] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) db1067 (s1 master) has too much concurrency to let the alter go thru, I will try a few more times before givin... [10:02:50] other backups seem mostly done except s8 on codfw [10:03:34] no issues with the path then? [10:04:24] non-so far, I discovered a bug on gathering stats but I fixed it on time on tuesday [10:04:45] as it used to be all backups are dumps and now there are dumps and shnapshots [10:04:47] cool [10:04:51] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) s3 eqiad [] labsdb1011 [] labsdb1010 [] labsdb1009 [] dbstore1004 [] dbstore1002 [] db1124 [] db1123 [] db109... [10:04:56] ah right! [10:05:11] 10Blocked-on-schema-change, 10MediaWiki-Change-tagging, 10Patch-For-Review, 10User-Ladsgroup: Drop change_tag.ct_tag column in production - https://phabricator.wikimedia.org/T210713 (10Marostegui) [10:05:33] should we have an specific way to differentiate those on the backups table? [10:05:41] it is [10:05:52] ah I see it now [10:06:01] | 619 | snapshot.s1-test.2019-01-22--11-39-08 | ongoing | dbstore1002.eqiad.wmnet:3311 | dbstore1001.eqiad.wmnet | snapshot | s1-test | 2019-01-22 11:39:08 | NULL | NULL | [10:06:38] s7 is also ongoing [10:08:06] I think now I am going to finish the pieces to allow post-processing and data-gathering on a binary backup [10:26:13] if you need help with db1067, I can have a look [10:26:49] don't worry, I will try a few more times, it basically has too much concurrency [10:49:56] 2 correctable memory errors on x1-master (to keep an eye) [10:51:01] https://phabricator.wikimedia.org/T201133 [10:51:23] I see [10:57:03] 10DBA, 10Operations, 10ops-eqiad: db1069 (x1 master) memory errors - https://phabricator.wikimedia.org/T201133 (10Marostegui) Just for the record ` db1069 Memory correctable errors -EDAC- WARNING 2019-02-20 10:45:24 2d 19h 28m 54s 3/3 2 ge 2 ` [11:54:26] 10DBA, 10Wikimedia-Site-requests, 10Serbian-Sites, 10Wikimedia-maintenance-script-run: Mass bigdeletion scheduled for sr.wikinews - https://phabricator.wikimedia.org/T212346 (10Zoranzoki21) >>! In T212346#4957919, @MarcoAurelio wrote: > Could I have my botflag at sr.wikinews removed, please? Until T122705... [15:31:59] 10DBA, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio: Global rename of The_Photographer → Wilfredor: supervision needed - https://phabricator.wikimedia.org/T215107 (10MarcoAurelio) >>! In T215107#4963000, @Wilfredor wrote: > How much it could take? Hello. I am not sure. There's a prob... [16:23:32] question- I ask to "prepare a backup" (xtrabackup --prepare, compress, gather statistics) but there happens to be 2 ongoing backups with the same section and type, what would you expect to see? [16:23:43] a) an error [16:23:58] b) assume we want to prepare the latest, the oldest? [16:25:42] normally I would opt for a, but note that b is not an unhandled errors, because the non-chosenn would be kept on ongoing, and we will check that the backup finished correctly anyway [16:37:53] what would happen if B kicks in? it will overwrite the existing backup that is being processed? [16:38:46] this may need some context [16:39:02] the idea is to have a new option (I am working on it right now) [16:39:29] dump_section.py s3 --only-postprocess [16:40:13] which moves things from a ongoing to latest, gather stats, and compresses/prepares it if it is a snapshot [16:40:53] however, what if there are 2 dump.s3. dump.s3. directories? [16:42:03] alternatively, I can do something like dump_section.py dump.s3. --only-postprocess [16:42:43] but I am too deep in devel so I cannot decide what makes more sense [16:43:04] or I can just give an error "there are several dumps, aborting" [16:48:10] so yes, it can "break" the backup if it is ongoing (in reality no, because it won't pass because it will lack metadata/snapshot files) [16:48:22] but that is true if it is run on a single file [16:48:31] I think I will generate an error [16:49:00] after all, you can also try to prepare a non-existing backup [16:50:16] let me read back [16:51:09] yeah, I prefer to generate an error rather than messing up (or potentially messing up) with an ongoing backup [17:09:17] 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Anomie) Is there any way to find which queries are using them? The `(16)` prefix on those strikes me as particu... [17:14:32] 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10jcrespo) >>! In T51199#4969053, @Anomie wrote: > Is there any way to find which queries are using them? It sho... [17:23:08] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) [17:24:08] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10Marostegui) Broken storage?: ` Feb 18 13:24:54 mysqld[837]: InnoDB: Error number 5 means 'Input/output error'. ` [17:24:34] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) For #DBA , I will poke you on Thursday for some assistance. Not sure whether the Innodb database can be recovered. dep... [17:24:47] 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Marostegui) It is certainly being used for some queries, I can see this counter increasing: ` root@db1089.eqiad... [17:27:21] the beta cluster master database is dead (innodb corruption somehow) that is https://phabricator.wikimedia.org/T216635 [17:27:29] but I think it is too later for us european to look into it [17:27:29] hashar: check my comment [17:27:57] maybe the disk / innodb tables are corrupted yeah ;(( [17:28:17] I would suggest to look at its storage, cause that error isn't looking great [17:28:20] krenair was looking at the slave , maybe ew can reuse that to repopulate a master [17:28:25] yeah :( [17:28:33] The log sequence numbers 212565209189 and 212565209189 in ibdata files do not match the log sequence number 233682420105 in the ib_logfiles! [17:29:03] I felt maybe one is out of date and maybe that can be recovered albeit with some data loss for the last operations [17:29:03] I would focus on the I/O error though [17:34:56] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) ` umount /srv fsck /srv mount /srv ` I restarted mysql. Same I/O error. Maybe the disk is corr... [17:37:00] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10Marostegui) Anything on `dmesg`? Can you do a `touch /srv/test`? [17:51:07] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) Ah dmesg! [Mon Feb 18 13:24:48 2019] EXT4-fs (dm-0): warning: mounting fs with errors, running e2fsck is recommended [... [17:51:18] 10Blocked-on-schema-change, 10MediaWiki-Database, 10MW-1.32-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), 10Schema-change: Add index log_type_action - https://phabricator.wikimedia.org/T51199 (10Anomie) I don't doubt that it is being used. But as we've seen elsewhere the planner can sometimes make strange... [17:54:19] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) I did a fsck again, but I am afraid the partitions are corrupted beyond control :/ head /usr/share/prometheus-node-e... [20:18:41] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) After I have done the fsck the I/O error is gone. It was: Feb 18 13:24:54 mysqld[837]: 190218 13:24:54 [ERROR] InnoDB:... [20:19:45] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10Addshore) p:05Triage→03High Marking as High (as I did with the other ticket) as beta is broken until this is fixed afaik. [20:28:01] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) Then I am trying: ` name=/etc/my.cnf [mysqld] innodb-force-recovery = 1 ` Then `sudo systemctl start mariadb` which sp... [20:45:28] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10Marostegui) Data looks very corrupted. At this point the best option is to rebuild that host from the slave. [20:45:49] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) Without innodb-force-recovery = 1, I get the same P8111 by simply moving enwiki/archive files. So that table definitely... [20:50:09] marostegui: yeah I got a fault when reading some page :((( [20:54:28] 10DBA, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: MySQL database on deployment-db03 does not start due to InnoDB issue - https://phabricator.wikimedia.org/T216635 (10hashar) 05Open→03Stalled Not much to do here. The slave deployment-db04 eventually managed to start mysql and new instances... [20:54:34] so yeah will wait for the slave to recover eventually ;) [20:55:47] * hashar waves [22:27:26] 10DBA, 10MediaWiki-API: API problem with usercontribs - https://phabricator.wikimedia.org/T216656 (10Anomie) **TL;DR:** The thing to do here is probably to stop using the 'contributions' replica group in ApiQueryUserContribs, at least when we're not using `rev_user` to specify the user. The query here is ` la... [22:29:06] 10DBA, 10MediaWiki-API: API problem with usercontribs - https://phabricator.wikimedia.org/T216656 (10Anomie) Speaking of the partitioning on the 'contributions' group, I wonder whether we'll start running into similar issues in the other things using that group once we flip the actor migration switch so querie...