[00:01:05] 10DBA, 10Operations, 10Ops-Access-Requests, 10cloud-services-team (Kanban): Access to raw database tables on labsdb* for wmcs-admin users - https://phabricator.wikimedia.org/T178128#3681728 (10bd808) [02:10:58] 10DBA, 10Commons: High replication lag causing read only mode on commons - https://phabricator.wikimedia.org/T178094#3681813 (10Bawolff) [02:16:06] 10DBA, 10Commons: High replication lag causing read only mode on commons - https://phabricator.wikimedia.org/T178094#3681814 (10Bawolff) [02:28:36] 10DBA, 10Commons: High replication lag causing read only mode on commons - https://phabricator.wikimedia.org/T178094#3681815 (10Bawolff) [02:32:33] 10DBA, 10Commons: High replication lag causing read only mode on commons - https://phabricator.wikimedia.org/T178094#3681819 (10Bawolff) So I guess part of the problem with the alerting is its looking at individual lag, where the real issue (MediaWiki goes into read-only mode) comes up when all the slaves are... [05:28:37] 10DBA, 10Data-Services, 10Security-Team: pagetranslation log_type missing on replicas - https://phabricator.wikimedia.org/T178052#3681916 (10Marostegui) Just for the record, those are replicated on the labs hosts, they are just missing the views: ``` mysql:root@localhost [commonswiki]> select @@hostname; +--... [05:32:32] 10DBA, 10Operations: Wikimedia\Rdbms\DBQueryTimeoutError (not repeated) - https://phabricator.wikimedia.org/T178109#3681065 (10Marostegui) I am not seeing an abnormal amount of errors on eswiki for the last 24h hours. Does this happen every time you try it or did it happen just one time? [05:40:45] 10DBA, 10Commons: High replication lag causing read only mode on commons - https://phabricator.wikimedia.org/T178094#3681920 (10Marostegui) >>! In T178094#3681819, @Bawolff wrote: > So I guess part of the problem with the alerting is its looking at individual lag, where the real issue (MediaWiki goes into read... [05:48:00] 10DBA, 10Data-Services, 10XTools: Request to increase active connection quota for user s51187 on analytics.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T177570#3681922 (10Marostegui) The user `s53003` had no connection limit on the old servers indeed. I have increased it from 10 to 20 on `.web`.... [05:51:12] 10DBA, 10MediaWiki-extensions-Linter: Purge html5-misnesting tags from the database - https://phabricator.wikimedia.org/T178040#3678602 (10Marostegui) Thanks for the heads up @Legoktm when would your script be finished? You planning to leave it running during the weekend if it doesn't finish today? Thanks! [05:53:10] 10DBA, 10Commons: High replication lag causing read only mode on commons - https://phabricator.wikimedia.org/T178094#3681926 (10Bawolff) >There has also been discussions on Operations on whether we should alert if mediawiki goes into read only, but this can cause many many false positives that it might be pagi... [05:58:25] 10DBA, 10Commons: High replication lag causing read only mode on commons - https://phabricator.wikimedia.org/T178094#3681927 (10Marostegui) >>! In T178094#3681926, @Bawolff wrote: >>There has also been discussions on Operations on whether we should alert if mediawiki goes into read only, but this can cause man... [06:10:59] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3681942 (10Marostegui) [06:30:21] 10DBA: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3681962 (10Marostegui) I am checksumming db1103 (which has db1035's data) with db1038's data (which is also on db1072 as db1072 was cloned from db1038) [06:59:56] 10DBA, 10MediaWiki-extensions-Linter: Purge html5-misnesting tags from the database - https://phabricator.wikimedia.org/T178040#3681978 (10Legoktm) 05Open>03Resolved The script finished sometime earlier today, I just didn't check the screen until now. [07:20:05] 10DBA, 10MediaWiki-extensions-Linter: Purge html5-misnesting tags from the database - https://phabricator.wikimedia.org/T178040#3678602 (10jcrespo) @Legoktm I see your SAL, but I do not see your entry on https://wikitech.wikimedia.org/wiki/Deployments can you add it? [07:47:45] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3682044 (10Marostegui) [07:53:30] stretch-based DB servers have been upgraded to the 4.9.51 kernel package [07:58:08] manuel, how are you comparing db1038 with the other host? [07:58:10] 10DBA, 10Operations, 10ops-codfw: db2081 unreachable - https://phabricator.wikimedia.org/T178140#3682064 (10Marostegui) [07:59:17] mydumper + zdiff [07:59:26] yeah, but where? [07:59:36] db1103 [07:59:47] you copied dump to that host? [07:59:58] yes, db1038 and db1103 are there now [07:59:59] or is it WIP? [08:00:08] can we delete it from 38? [08:00:17] oh, yes, let me do it [08:00:40] 50 free GB was too little free space :-) [08:00:40] doing it now [08:01:14] done [08:03:18] mmm, we lost monitoring on db1038? [08:03:49] it is ok: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&fullscreen&orgId=1&var-server=db1038&var-network=bond0&from=1507878220784&to=1507881820784 [08:03:52] I disabled it because it is depooled and about to be decommissioned: [08:04:07] https://phabricator.wikimedia.org/T177911#3674729 [08:04:41] cool [08:04:42] it was downtimed till monday (when I will get rid of it) [08:10:00] I am going to restart the commons purge [08:10:05] cool [08:21:57] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3682149 (10hoo) The `cawiki` module has been fixed now, but many other wikis still have the problematic code {T178114}... [08:22:05] jynus: ^ [08:22:49] But by now I think we might not really come around separating the table out :/ [08:23:44] "problematic code" are we talkin about on-wikixt issues? [08:24:05] Indeed… their module did something very inefficent [08:24:13] which was trivial to fix, so I made them fix it [08:25:48] yeah, the issue is that they had "apparently" used a lot of statements, right? [08:26:06] Yeah… they actually accessed them… but for no good reason [08:26:13] the thing is [08:26:21] others may not have that [08:26:38] but may have other very large usages :-/, real or not [08:26:58] True… but w/o code review we can hardly have reasonable certainty about this not happening again [08:27:15] they might do something equally big in usage at any time [08:27:15] well, not on code [08:27:28] but on templates [08:27:36] and maybe not now, but in the future [08:27:43] Yeah… templates and modules can be changed at will [08:27:53] what if {{CC-BY-SA-4.0}} gets wikidata code? [08:27:56] on commons [08:28:23] who says it doesn't have it already, and it is the cause of the RCs issues? [08:28:36] It might actually already have [08:28:55] I'm working with the people on commons to get their modules/ templates reasonable [08:29:05] this should actually start paying off there soon [08:29:59] in any case [08:30:11] can I have a look at the "insertion" code? [08:30:31] we may not be able to see it, but we could at leasy slow down more the write rate [08:30:53] in a worse case scenario [08:32:05] https://phabricator.wikimedia.org/diffusion/EWBA/browse/master/client/includes/Usage/Sql/EntityUsageTable.php;88b9a2f81d57b657ed41484185bec260aefb1b55$170 [08:33:15] (I'm fairly sure we never hit this code path during a user web request) [08:33:22] I know [08:33:42] well, I don't know, but I would assume [08:34:48] I'd say to increase the batch size to 1000 and add a wait for slaves after endAtomic [08:35:30] Ok, but just for the insert, I guess? [08:35:42] what other things are there? [08:36:02] oh the remove [08:36:23] let me see, was the issue last time with inserts or with removes? [08:38:27] we had problems with reads there at some point (but these are not supposed to hit the master) [08:39:07] do you have a timestamp for me? [08:39:09] to check [08:39:15] aproximately? [08:39:26] Should have been deployed in early August [08:39:32] no [08:39:39] the thing manuel told you about [08:40:10] a few days ago [08:41:43] hm… can't remember that :S [08:42:00] ok, I will check on the logs [09:09:29] I do not see any reason not to continue [09:09:49] Ok… but with changed batch size and wait for slaves? [09:10:25] that should be there any way, but it is not relevant for this [09:12:19] Ok, noted [09:13:13] 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Russian-Sites: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772#3682248 (10jcrespo) 53156406 rows purged on commons so far of the initial 58M esti... [09:17:51] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3682285 (10Marostegui) [09:30:02] marostegui: I am going to depool db1098 for recentchanges defragmenting [09:30:16] do you want me to run other alters/optimizes there? [09:30:59] Oh, I just replied on the gerrit patch :) [09:31:06] he he [09:31:11] yeah, if you can do pagelinks and templatelinks across frwiki, jawiki and ruwiki [09:31:14] that'd be great [09:31:21] just optimize? [09:31:23] yep [09:31:26] ok [09:31:36] and once done, let me know (or update: T174509) [09:31:36] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [09:31:38] thank you :) [09:31:47] I will update is directly [09:31:52] \o/ [09:31:52] thanks [09:32:05] let me ammend to add the ticket [09:32:08] so I will not forget [09:33:04] how long does it take for you, if done in parallel? [09:33:08] on a large server [09:33:17] (aprox) [09:33:30] mmm, around 4-5 hours I think [09:33:35] but I have done so many shards... [09:33:37] that I am not sure :) [09:34:47] we'll see how it reacts to the depool [09:47:33] there is an inconsistency between number of rows in prod and labs in ores_classification table in wikidatawiki [09:47:46] labs says 16M rows, prod says 46M [09:48:24] how do you find out about the number of rows? [09:48:38] select * from information_schema.TABLES where table_name = 'ores_classification'\G [09:48:42] in prod [09:49:00] in labs, select count(*) from ores_classification; [09:49:01] ok, so basically, show table status [09:49:15] one of those you mention is inacurate [09:49:25] :-) [09:49:30] that is by design [09:50:23] select count(*) from ores_classification; in prod also says 46M [09:50:26] it's just slower [09:50:43] exactly [09:51:25] one is slow but accurate, the other is fast ("cached") but inaccurate [09:51:49] I see, okay [09:51:52] Thanks [09:51:55] it can be updated with an analyze table, but it creates slowdown/replag [09:52:05] so we only do it if needed [09:52:31] note there is no perfect count(*) [09:52:43] we probably need to shrink it later (once it's done cleaning up) it'll free up some space for you [09:52:48] different users can see different rows at any time [10:34:11] the old dumps seems stuck, or at least still running [10:34:39] it could be ther reboot [11:24:57] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3682593 (10Marostegui) [12:00:51] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T178157#3682643 (10Marostegui) p:05Triage>03Normal [12:01:43] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T178157#3682623 (10Marostegui) a:03Cmjohnson @Cmjohnson can we get this disk replaced if we have spares? [12:03:54] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T178157#3682658 (10Marostegui) We could get disks from db1049 which are the same size and db1049 is ready to be decommissioned (T175264) [12:06:32] 10DBA: Run pt-table-checksum on s3 - https://phabricator.wikimedia.org/T164488#3682674 (10Marostegui) Currently fixing inconsistencies on db1072 and db1038 (even though it will be decommissioned, it takes just a few commands to get that one fixed too) and checking all the rest of the hosts for the values that ar... [12:42:13] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T178157#3682623 (10jcrespo) @Marostegui shouldn't we just schedule it for decom- the data from the original master was copied to db1098, and we already run checksum on these hosts. Plus it is depooled already. [12:43:08] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T178157#3682909 (10Marostegui) >>! In T178157#3682903, @jcrespo wrote: > @Marostegui shouldn't we just schedule it for decom- the data from the original master was copied to db1098, and we already run checksum on th... [12:44:41] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T178157#3682918 (10jcrespo) > I didn't know its data was copied over to db1098! It wasn't, an older, an potentially more accurate/more different old master was; but it doesn't matter, it was fully checked on T160509. [12:45:16] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1050 - https://phabricator.wikimedia.org/T178157#3682921 (10Marostegui) 05Open>03declined Host to be decommissioned [12:45:38] or we can just reconvert it to decom :-) [12:45:46] 10DBA: Decommission db1050 - https://phabricator.wikimedia.org/T178162#3682923 (10Marostegui) [12:45:46] haha ^ [12:45:53] thats ok [12:46:02] 10DBA: Decommission db1050 - https://phabricator.wikimedia.org/T178162#3682935 (10Marostegui) p:05Triage>03Normal [12:46:14] 10DBA: Decommission db1050 - https://phabricator.wikimedia.org/T178162#3682923 (10Marostegui) [12:46:16] 10DBA, 10Operations, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3682937 (10Marostegui) [15:43:55] 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Patch-For-Review, 10Russian-Sites: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772#3683399 (10jcrespo) 05Open>03Resolved There are more thi... [15:44:05] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3683401 (10jcrespo) [15:48:38] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3683407 (10jcrespo) The initial scope- query issues on ruwiki... [15:49:25] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3683411 (10jcrespo) a:05jcrespo>03None [15:52:44] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3683421 (10jcrespo) For a separate ticket, other potential pr... [15:53:35] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3683423 (10jcrespo) p:05Unbreak!>03Normal Lowering priori... [16:00:09] 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Patch-For-Review, 10Russian-Sites: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772#3683438 (10jcrespo) For the curious, 150GB of disk space (an... [16:01:58] 10DBA, 10Data-Services, 10XTools: Request to increase active connection quota for user s51187 on analytics.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T177570#3683443 (10MusikAnimal) Thanks! I should also mention our query killer apparently wasn't working... pretty sure that was partly at fault... [16:07:25] 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Patch-For-Review, 10Russian-Sites: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772#3683463 (10Marostegui) >>! In T177772#3683438, @jcrespo wrot... [16:09:56] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3683468 (10jcrespo) If you still want to do recentchanges at the same time, it only takes between 18 seconds and 18 minutes; but I can create my own t... [17:00:04] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3683631 (10Marostegui) >>! In T174509#3683468, @jcrespo wrote: > If you still want to do recentchanges at the same time, it only takes between 18 seco... [17:12:03] 10DBA, 10Data-Services, 10XTools: Request to increase active connection quota for user s51187 on analytics.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T177570#3683698 (10Marostegui) Let's leave this open for the weekend and we can close it on Monday or Tuesday if all is fine, sounds good? [17:12:40] 10DBA, 10Data-Services, 10XTools: Request to increase active connection quota for user s51187 on analytics.db.svc.eqiad.wmflabs - https://phabricator.wikimedia.org/T177570#3683700 (10MusikAnimal) >>! In T177570#3683698, @Marostegui wrote: > Let's leave this open for the weekend and we can close it on Monday... [19:10:50] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3684021 (10Jwh) Works fine now. Thank you so much to have sol... [20:18:39] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team (MWPT-Q1-Jul-Sep-2017): Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3684216 (10CCicalese_WMF) [20:30:15] 10DBA, 10Community-Tech, 10Data-Services, 10Security, 10cloud-services-team (Kanban): Create core ip_changes view for replicas - https://phabricator.wikimedia.org/T173891#3684240 (10bd808) This should be ready for maintain-views to be run on labsdb* to actually create the view in each wikidb. I //think//... [20:31:51] 10DBA, 10Community-Tech, 10Data-Services, 10Security, 10cloud-services-team (Kanban): Create core ip_changes view for replicas - https://phabricator.wikimedia.org/T173891#3543452 (10bd808) p:05Triage>03Normal [20:32:27] 10DBA, 10Structured-Data-Commons, 10Wikidata, 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018): Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044#3684245 (10CCicalese_WMF) [20:32:59] 10DBA, 10wikitech.wikimedia.org: Rename database labswiki to wikitech - https://phabricator.wikimedia.org/T171570#3684246 (10bd808) [22:06:22] 10DBA, 10MediaWiki-Parser, 10Performance-Team, 10MediaWiki-Platform-Team (MWPT-Q1-Jul-Sep-2017): WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784#3684516 (10CCicalese_WMF) [22:39:56] 10DBA, 10monitoring, 10Epic, 10Patch-For-Review, 10Wikimedia-Incident: Reduce false positives on database pages - https://phabricator.wikimedia.org/T177782#3669845 (10Dzahn) The change i uploaded above intents to disable paging for the specific check "mysql procs running" if a host is in labs/labtest. I... [22:41:16] 10DBA, 10monitoring, 10Epic, 10Patch-For-Review, 10Wikimedia-Incident: Reduce false positives on database pages - https://phabricator.wikimedia.org/T177782#3684635 (10Dzahn)