[05:29:35] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T174857#3583074 (10Marostegui) 05Open>03Resolved a:03Cmjohnson This is all good now, thanks a lot Chris! ``` root@db1059:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Ta... [06:38:09] 10DBA: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306#3583165 (10Marostegui) p:05Triage>03Normal [07:21:17] marostegui: hey, is it normal? https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=5&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&from=now-1h&to=now [07:21:24] checking [07:21:39] i was just checkking that actually [07:21:41] db1095 [07:21:46] yep [07:21:48] I was on that host right now [07:22:00] it is not "production" it is a sanitarium host, which is used to filter data to labs [07:24:19] thanks. I was thinking if I need to disable the cronjob [07:25:12] I think it is coming from a massive truncate I am doing on s3 (which db1095 also replicates), and might be too heavy for it to cope with it without more throttling [07:25:15] I stopped it and it is now decreasing lag [07:25:17] so looks related [07:28:57] it is back to 0, I am going to add more throttling [08:16:40] 10DBA, 10Operations, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3583335 (10jcrespo) Not yet, this is still in use. [08:28:34] 10DBA: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294#3583345 (10Marostegui) I will migrate db1100 to file per table once the copy is done. db1049 was not using it. [08:36:25] wait, if T150306 is to be done [08:36:25] T150306: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306 [08:36:35] how come we are still writing to it? [08:36:52] ? [08:37:02] or is the truncate what was replicated? [08:37:03] what do you mean we are still writing? [08:37:06] the truncate [08:37:08] ah [08:37:10] ok [08:37:25] I thought it was rows being written to mediawiki [08:37:33] *from [08:37:35] ah no no :) [08:38:16] is that table filtered on labs? does it need special attention there? [08:39:29] and a second question, did it only fail on bawiktionary, but worked on other wikis on dbstore1002 with no problem? [08:39:38] no special attention on labs [08:39:59] regarding bawiktionary, only that one, but it would have failed on 10 more, which I have created and will later drop, to avoid breaking replication [08:40:44] interesting [09:03:02] I am going to merge gerrit:362217 [09:03:16] but I will disable puppet on all important hosts [09:03:46] i see [09:03:50] go for it [09:08:03] it should be safe, but I am not going to risk a network outage [09:08:16] hehe yeah [09:10:08] DNS query for '10.64.0.15' failed: NXDOMAIN [09:10:23] from where [09:10:29] because neodymium resolves it [09:10:32] db1066 [09:10:48] apparently db1066 doesn't [09:11:05] is it thinking that is a dns? [09:11:08] root@db1066:~# host 10.64.0.15 [09:11:08] 15.0.64.10.in-addr.arpa domain name pointer db1011.eqiad.wmnet. [09:11:08] root@db1066:~# host db1011.eqiad.wmnet. [09:11:08] db1011.eqiad.wmnet has address 10.64.0.15 [09:11:09] ? [09:11:14] and it is trying to direct-resolve an ip? [09:12:33] s/@resolve(($MYSQL_ROOT_CLIENTS))/$MYSQL_ROOT_CLIENTS/ I guess? [09:13:35] let's try yes [09:13:38] it is strange [09:15:53] 10DBA: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306#3583460 (10Marostegui) Dropped from all the shards except from s3, which is being done now slowly [09:24:11] it is working and I do not see any network drop on db1066, so I am allowing puppet everywhere [09:24:23] \o/ [09:24:55] 10DBA: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306#3583486 (10Marostegui) a:03Marostegui [09:33:26] hello people, for https://phabricator.wikimedia.org/T174815 I'd like to increase the batch size of eventlogging_sync.sh from 1000 to 5k/10k, anything against it? [09:34:18] how would that affect dbstore1002 disk issues? [09:35:18] elukey: if you hit max package or statement size, you could run into problems [09:36:14] marostegui: it wouldn't, it will only increase the maximum amount of inserts done in once IIUC [09:36:44] jynus: super ignorant about this part, can you explain that to me a bit more? [09:37:18] there is a max limit on the amount of query text and results that you can do per query [09:37:35] elukey: good, same rate but different batch size. right! [09:37:37] be aware that if you surpass it, query will fail [09:37:59] jynus: ahh okok [09:38:23] https://dev.mysql.com/doc/refman/5.7/en/packet-too-large.html [09:38:59] all right, will carefully watch logs [09:41:43] I am not saying it is going to happen [09:41:55] but a) it is the typical issue when making things larger [09:42:08] b) I think there was some issues in the past on analytics hosts [09:42:23] but I do not remember specifics [09:42:37] yep got it, it is a good heads up, I was looking for this kind of advice [09:43:01] there is a "backfill" script from db1046 to the replicas [09:43:23] running that and seeing no row has to be backfilled would be a good indication no row is being lost [09:44:46] * elukey learns about a new script [09:45:09] those hosts are a neverending source of surprises [09:45:34] I can imagine all the "fun" discovering them when nothing was puppetized :D [09:46:05] well, I actually did not change much there because it was a minefiled [09:46:17] I just puppetized what was already running there [09:46:30] even knowing it was far from perfect [09:47:19] I wonder if the script would be necessary anymore if we upgrade to 10.1 and enable (with help of small app fixes) parallel replication [09:47:51] although I think tokudb has issues with parralel replication [09:49:44] I am about to enable puppet on all hosts again [10:00:17] ok to restart apache on dbmonitor for a security update? [10:00:45] yes [10:01:02] k, going ahead [10:01:38] I think it is going to be ok everytime, unless we are in the middle of a master failover or similar maintenance [10:01:54] in which monitoring is very important [10:02:29] or we are upgrading tendril or something like that [10:04:07] s5 replication lag didn't recover on time on dbstore1001 for the following backup cycle [13:23:16] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3584265 (10jcrespo) db1053 done, db1064 next. [13:24:44] \o/ [14:30:47] 10DBA, 10Discovery, 10GeoData, 10Maps-Sprint: Removal of {{#coordinates:}} leaves DB entries behind - https://phabricator.wikimedia.org/T143366#2566073 (10TheDJ) The coordinates parser function is within the scope of the GeoData extension. [15:01:04] 10Blocked-on-schema-change, 10Security-Reviews: Identify the source of WHOIS data, the retrieval method, and update frequency - https://phabricator.wikimedia.org/T175160#3584584 (10Huji) [15:36:46] 10DBA: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306#3584710 (10Marostegui) Truncation still happening on s3. I have throttled it quite a bit, because db1095 (sanitarium) was struggling to be able to replicate without any delay. [15:45:52] 10Blocked-on-schema-change, 10Security-Reviews: Identify the source of WHOIS data, the retrieval method, and update frequency - https://phabricator.wikimedia.org/T175160#3584564 (10Marostegui) Hi, Is this meant to be a database schema change? https://wikitech.wikimedia.org/wiki/Schema_changes If it is: can... [15:51:17] 10Blocked-on-schema-change, 10Security-Reviews: Identify the source of WHOIS data, the retrieval method, and update frequency - https://phabricator.wikimedia.org/T175160#3584785 (10Huji) For IP subnet information, we could parse the data directly from RIRs. The data is available for free (in both senses of the... [18:35:05] marostegui: jynus It'd be nice to go over some extension tables to see what updates they need... PK, unsigned etc etc [18:35:21] I might file a meta task for looking over them [18:37:01] Don't know how many tables without a PK we still have - not many I would say [18:37:33] Or am I being naieve? [18:37:45] When you're doing those checks.. You're actually checking the extension tables too? [18:38:15] I haven't seen m?any tasks for extensions.. And I can't believe they're perfect basd on how core is :P [18:38:19] Reedy: Don't really remember, as we started doing it like 7 months ago XD [18:38:32] No no, probably they need lots of fixing I guess [18:38:40] Reedy: I am being conservative- let's fix core first [18:38:45] the rest can wait [18:38:46] heh [18:38:54] But don't know if we'd have time for it now [18:39:02] specially if we hit not-very-well maintened ones [18:39:20] I am not saying it shouldn't be done [18:39:31] I am just being realistic about my expectations [18:39:35] of course, that's be weirdly hypocritical [18:39:36] haha [18:40:19] the idea is once I added the "policy" [18:40:39] that anything either "officially sanctioned" or deployed to "wmf production" [18:40:49] should always have a PK [18:40:58] people have been slowly doing it [18:41:46] in some cases, when people asked for a schema change, I "convinced" them into adding them at the same time [18:41:59] * marostegui loves that [18:42:10] haha [18:42:18] That's the perfect time to do it [18:42:20] so for the largests/more important (think wikibase, centralauth, flow, translation) [18:42:28] Bit of CR time now saves potentially a lot of time later for yourselves [18:42:28] they are fixed or very close [18:42:32] yep [18:42:42] that is why I went "political" first [18:42:44] heh [18:42:46] You've also got the [18:42:54] "If you don't add it, this isn't getting deployed" [18:42:58] It's a good way to get people to do stuff [18:43:00] XDDDDD [18:43:03] so I have a useless, but people seemingly respect "policy" [18:43:15] Especially, when it's not an unrreasonable request [18:43:20] usless in terms that I said "the policy says X" [18:43:35] and it has a magic effect [18:43:39] It is soooo painful to add a PK to a table that is already being used without one...pff [18:43:57] even if it was just added by me on the wiki and "discussed" [18:44:22] and nobody was going to object to "not add PKs" [18:45:15] the problem is not only that, maintenance without PK is almost impossible [18:45:26] and in most cases it is a design error, too [18:45:37] for example, the query cache one [18:46:05] I am not saying that a PK would fix that, but if it had been well designed, it wouldn't be the mess that it was now [18:46:33] And even if it is not the best PK, by having one, you can easily change it :) [18:46:33] heh [18:46:35] Reedy: you will be happy to now that I have compared almost every row of every image table of every myserver [18:46:38] *know [18:46:48] and finnaly, for the first time in 10 years [18:46:56] we have the same data on all hosts! [18:47:09] jynus: do you know all the images in commonswiki by memory now? [18:47:34] not the images, but the chunks and the "events" [18:47:42] the 2005 bug [18:47:47] same data on all hosts is overrated [18:47:47] the 2014 bug [18:47:52] and the 2015 bug [18:47:59] what was that? [18:48:02] on moving/deleting images [18:48:14] discrepancies cluster around certain dates [18:48:17] Reedy: sure, no more random page feature needed! [18:48:32] and in some cases, I can pinpoiint to specific code deploys/bugs [18:48:40] marostegui: our not so random random pages... [18:48:57] the funny thing is that in most cases, there was not data loss [18:49:07] but there were not-deleted redirects [18:49:17] and same rows with different ids [19:06:22] 10DBA, 10Community-Tech, 10cloud-services-team, 10Security: create production ip_changes table for RangeContributions - https://phabricator.wikimedia.org/T173891#3585986 (10Umherirrender) [19:30:25] brad just merged my patch for making various columns unsigned [19:30:26] https://gerrit.wikimedia.org/r/350437 [19:35:40] cool [19:39:04] 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#3586122 (10Reedy) https://gerrit.wikimedia.org/r/#/c/350437/ is merged, so good to go whenever you are :) [19:39:44] 10Blocked-on-schema-change, 10DBA: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737#3586140 (10Reedy) [19:46:07] 10DBA: Run pt-table-checksum on s4 (commonswiki) - https://phabricator.wikimedia.org/T162593#3586159 (10jcrespo) I am quite confident of image now (I didn't check every host, but codfw for example seems to be in a much better state), checking filearchive next. [20:47:56] 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3586457 (10Nuria) Looks like it is all good: select wiki_db, count(*) from wmf.mediawiki_history where snapshot = "2017-08" and wiki_d... [20:48:12] 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3586458 (10Nuria) 05Open>03Resolved [21:26:36] 10DBA, 10Analytics, 10Contributors-Analysis, 10Chinese-Sites, 10Patch-For-Review: Data Lake edit data missing for many wikis - https://phabricator.wikimedia.org/T165233#3586571 (10Neil_P._Quinn_WMF) @Nuria, yes, the queries where I originally discovered this work now. Thank you! [23:18:31] 10DBA, 10Data-Services, 10Epic: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3586971 (10Dispenser) From arwiki: [[https://ar.wikipedia.org/wiki/File:%D8%A7%D9%84%D9%84%D9%87_%D8%B9%D8%B2_%D9%88%D8%AC%D9%84.png|File:الله عز وجل.png]]. Deleted in Nov 2015. ```lang=bash $ sql arwiki_p...