[04:49:28] (03CR) 10Nuria: [C: 03+1] Rename hive_fields to be more descriptive [analytics/refinery] - 10https://gerrit.wikimedia.org/r/624779 (owner: 10Milimetric) [06:05:03] good morning [06:23:40] I am doing a restart of namenodes and resource managers (for openjdk updates) [06:32:06] or I can write a cookbook [07:00:52] Good morning [07:01:14] elukey: as you wish :) [07:01:28] almost done :) [07:02:10] ok - too late I am :) [07:02:46] I basically dumped what I do every time, I think it is good also for documentation [07:02:55] this is one of the things that I haven't automated yet [07:02:58] It definitely is [07:04:53] and I can test it on hadoop test first, to avoid destroying prod [07:07:03] OTOH destroying prod certainly simplifies the bigtop migration :-) [07:07:18] it does yes! [07:26:47] there you go https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/625782/ [08:07:10] mgerlach: Hello :) You have ajob taking 2/3 of the cluster right now - I assume it's a misconfiguratio [08:07:25] joal: ups [08:07:31] thanks for pointing out [08:07:59] mgerlach: Very high RAM-to-CPU ratio (a lot of ram per CPU) [08:08:16] Thanks a lot mgerlach :) [08:23:52] * elukey coffee [09:06:16] the new Hue supports also Presto afaics [09:42:34] the Hue upstream devs answered, didn't merge but it is a positive progress :D [09:43:56] \o/ [10:36:51] * elukey lunch! [11:26:21] 10Analytics, 10Analytics-Kanban: Test hudi as an incremental update system using 2 mediawiki-history snapshots - https://phabricator.wikimedia.org/T262256 (10JAllemandou) [11:26:42] 10Analytics, 10Analytics-Kanban: Test hudi as an incremental update system using 2 mediawiki-history snapshots - https://phabricator.wikimedia.org/T262256 (10JAllemandou) [11:54:56] 10Analytics, 10Analytics-Kanban: Test hudi as an incremental update system using 2 mediawiki-history snapshots - https://phabricator.wikimedia.org/T262256 (10JAllemandou) Learnt stuff: - Hudi needs a primary-key. We use a hash of values as a compound key. See code below about the implementation. - one month... [11:57:22] 10Analytics: Make hudi work with Hive - https://phabricator.wikimedia.org/T262260 (10JAllemandou) [11:57:53] 10Analytics, 10Analytics-Kanban: Test hudi as an incremental update system using 2 mediawiki-history snapshots - https://phabricator.wikimedia.org/T262256 (10JAllemandou) a:03JAllemandou [11:59:50] 10Analytics, 10Analytics-Kanban: Check that mediawiki-events match mediawiki-history changes over a month - https://phabricator.wikimedia.org/T262261 (10JAllemandou) [12:00:05] 10Analytics, 10Analytics-Kanban: Check that mediawiki-events match mediawiki-history changes over a month - https://phabricator.wikimedia.org/T262261 (10JAllemandou) a:03JAllemandou [12:01:31] 10Analytics, 10Analytics-Kanban: Check that mediawiki-events match mediawiki-history changes over a month - https://phabricator.wikimedia.org/T262261 (10JAllemandou) Bug found in mediawiki_revision_create events: some revision-create have multiple events (same revision-id, multiple event-requests) - Tracked in... [12:02:17] 10Analytics, 10Analytics-Kanban: Check that mediawiki-events match mediawiki-history changes over a month - https://phabricator.wikimedia.org/T262261 (10JAllemandou) Missing events for user_create and user_rename. Tracked in T262205. [13:06:35] elukey: o/ [13:06:45] 'The new Hadoop workers with GPUs have 24x2TB disks, and no flex bay.', is that just because these are the GPU nodes? Or is that a mistake? [13:06:45] ottomata: gooood morning [13:06:58] Do the other many nodes we ordered have flex bays? [13:07:31] (03CR) 10Mforns: Removing seasonality cycle as it is fixed once granularity is set (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/623456 (https://phabricator.wikimedia.org/T257691) (owner: 10Nuria) [13:07:39] ottomata: nono it was due to the gpus, the chassis is different and there was no option for flex bay.. We discussed it with Rob in the task IIRC [13:07:46] together with the 24 disks issue [13:08:27] all the regular hadoop workers have the flex bay, but I think that Chris didn't configure it (I followed up in the racking task to make the usual hw raid1) [13:09:36] ottomata: https://phabricator.wikimedia.org/T242147#5965183 [13:09:39] ok phew [13:09:45] "The issue comes into place where the old hadoop spec was 12 * 4TB LFF SATA HDD + 2 * SFF flexbay 240GB SSD (os disks). The new chassis has the GPU fan upgrades, which only allow for 24 SFF disks and no flexbay." [13:10:02] ok great, nice [13:10:07] thanks didn't remember all that [13:10:10] great stuf [13:10:28] I tried to come up with a new partition layout that was not super cumbersome to maintain, lemem know if it is ok for you (basically two 2TB disks used for root etc.. in raid 1) [13:17:35] yaya sounds good, sounds easier than trying to partition part of those disks [13:17:58] yes I thought the same.. life is too short to mess with partitions :D [13:22:53] o/ [13:40:34] there was an outage in one rack, we have two aqs nodes down + eventlog1002 afaics [13:42:33] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-si [13:42:33] }/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received [13:42:33] imedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:42:49] yep [13:43:27] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-si [13:43:27] }/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:43:57] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/me [13:43:57] file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:44:19] PROBLEM - Check the last execution of analytics-dumps-fetch-mediacounts on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:45:37] hi a-team, is the stat1006 machine working? My connection died and now I can't ssh [13:46:49] dsaez: same here [13:47:25] from messages in #wikimedia-operations and #wikimedia-sre, it looks like there is a network switch problem [13:47:45] dsaez: there seems to be a big network outage, we are still trying to check [13:48:28] ooh, I see...I can log in to stat1008, but not to stat1006 :S [13:48:46] they are probable different racks [13:49:01] big outage happening now, not just stat boxes it looks like [13:49:04] SRE is on it [13:49:58] I see [13:50:11] rhx for the info [13:50:39] PROBLEM - Check the last execution of refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:51:39] PROBLEM - cache_upload: Varnishkafka webrequest Delivery Errors per second -codfw- on icinga1001 is CRITICAL: 391.4 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [13:52:02] PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -codfw- on icinga1001 is CRITICAL: 968.7 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [13:52:03] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:52:19] PROBLEM - cache_upload: Varnishkafka webrequest Delivery Errors per second -ulsfo- on icinga1001 is CRITICAL: 57.9 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=ulsfo+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [13:52:29] PROBLEM - cache_text: Varnishkafka eventlogging Delivery Errors per second -eqsin- on icinga1001 is CRITICAL: 18.65 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqsin+prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=All [13:52:37] PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -ulsfo- on icinga1001 is CRITICAL: 239.4 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=ulsfo+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [13:52:57] PROBLEM - cache_upload: Varnishkafka webrequest Delivery Errors per second -esams- on icinga1001 is CRITICAL: 1220 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=esams+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [13:53:09] PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -eqsin- on icinga1001 is CRITICAL: 1595 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqsin+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [13:53:39] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:53:53] PROBLEM - cache_text: Varnishkafka eventlogging Delivery Errors per second -codfw- on icinga1001 is CRITICAL: 14.12 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=All [13:54:39] PROBLEM - Check the last execution of analytics-dumps-fetch-pageview on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:55:25] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on icinga1001 is CRITICAL: 3165 gt 1000 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [13:56:57] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:58:17] PROBLEM - Throughput of EventLogging NavigationTiming events on icinga1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [13:58:37] 10Analytics-Radar, 10Operations, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10Vgutierrez) [14:00:49] RECOVERY - cache_upload: Varnishkafka webrequest Delivery Errors per second -ulsfo- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=ulsfo+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [14:01:09] RECOVERY - cache_text: Varnishkafka webrequest Delivery Errors per second -ulsfo- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=ulsfo+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [14:02:15] RECOVERY - cache_upload: Varnishkafka webrequest Delivery Errors per second -codfw- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [14:02:25] RECOVERY - cache_text: Varnishkafka eventlogging Delivery Errors per second -codfw- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=All [14:02:32] this is good [14:02:39] RECOVERY - cache_text: Varnishkafka webrequest Delivery Errors per second -codfw- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [14:03:09] RECOVERY - cache_text: Varnishkafka eventlogging Delivery Errors per second -eqsin- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqsin+prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=All [14:04:04] Gone for kids team - will be back for standup [14:05:41] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{proje [14:05:41] }/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) timed out before a response was received: /analytics.wi [14:05:41] dits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipe https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:06:02] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:06:33] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:07:03] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per [14:07:03] t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:07:59] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:08:09] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on icinga1001 is CRITICAL: 2.55e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [14:08:17] situation is still not stable sadly [14:08:35] PROBLEM - Check the last execution of camus-eventgate-main_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit camus-eventgate-main_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:10:43] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:27] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 9754 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [14:15:35] PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -codfw- on icinga1001 is CRITICAL: 1050 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [14:15:42] PROBLEM - Check the last execution of analytics-dumps-fetch-clickstream on labstore1007 is CRITICAL: CRITICAL: Status of the systemd unit analytics-dumps-fetch-clickstream https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:15:45] PROBLEM - cache_upload: Varnishkafka webrequest Delivery Errors per second -ulsfo- on icinga1001 is CRITICAL: 64.53 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=ulsfo+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [14:16:05] PROBLEM - cache_text: Varnishkafka webrequest Delivery Errors per second -ulsfo- on icinga1001 is CRITICAL: 214.2 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=ulsfo+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [14:17:41] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:17:49] we should get some recovery soon [14:17:53] RECOVERY - cache_upload: Varnishkafka webrequest Delivery Errors per second -ulsfo- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=ulsfo+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [14:18:07] PROBLEM - cache_text: Varnishkafka eventlogging Delivery Errors per second -eqsin- on icinga1001 is CRITICAL: 32.23 ge 5 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqsin+prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=All [14:18:15] RECOVERY - cache_text: Varnishkafka webrequest Delivery Errors per second -ulsfo- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=ulsfo+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [14:18:25] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:18:42] the rack D3 is still impacted, its switch is the one that was causing this mess, they are debugging it. The following nodes will stay down: stat1006, thorium, eventlog1002. Wikistats is currently not available. [14:18:42] D3: test - ignore - https://phabricator.wikimedia.org/D3 [14:19:39] RECOVERY - cache_text: Varnishkafka webrequest Delivery Errors per second -codfw- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&prometheus/ops&var-source=webrequest&var-cp_cluster=cache_text&var-instance=All [14:20:07] RECOVERY - cache_text: Varnishkafka eventlogging Delivery Errors per second -eqsin- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=eqsin+prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=All [14:22:39] Hello, I am trying to kill some zombie jobs in 1005 with kill -9, but it doesnt kill them. Any idea? [14:23:40] agaduran: hi! Are those zombies related to your process already exited? [14:24:01] Yes [14:24:03] zombies are basically leftovers and they are not really running, so in theory you can't kill them [14:24:42] RECOVERY - cache_upload: Varnishkafka webrequest Delivery Errors per second -esams- on icinga1001 is OK: (C)5 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka https://grafana.wikimedia.org/d/000000253/varnishkafka?panelId=20&fullscreen&orgId=1&var-datasource=esams+prometheus/ops&var-source=webrequest&var-cp_cluster=cache_upload&var-instance=All [14:25:01] moreover, zombies consume almost no resources [14:25:15] A process table entry, that's about it. [14:25:24] yep --^ [14:25:36] You should try and find their PPID and see why that process is not reaping children [14:26:07] I see only some processes in Z state but not for your user agaduran [14:26:11] (on stat1005) [14:27:55] https://phabricator.wikimedia.org/P12524 [14:28:28] python generate_anchor_dictionary_simple.py de seems to not reap generate_anchor children (or is not doing so quickly enough for them to not be visible as Zombies) [14:28:43] elukey: I see, thanks! That is weird. 1005 has become unrunnable since this morning (also mgerlach is experiencing similar problems), and I thought this might be the reason. Anyways, there are something taking up all the memory without doing anything [14:29:21] (the paste I linked is partial output of ps axf which I find very helpful in these situations) [14:29:23] agaduran: ah nono there is a network outage that is being resolved atm [14:29:51] klausman: really nice thanks, I didn't find them with ps aux | grep Z sigh :( [14:30:15] I still see them with that [14:30:36] All were started at 14:19 UTC [14:30:43] elukey: Aha I see. Then this might be the problem! [14:31:24] klausman: if you want to have "fun", in icinga.wikimedia.org there is a looong list of open issues due to the network outage [14:31:41] in theory, now only hosts in rack d3 should be impacted (there are a few of ours) [14:31:56] Yeah, I saw the spam in the IRC channels and went looking. A bloodbath :) [14:31:56] full list in https://netbox.wikimedia.org/dcim/racks/37/ [14:32:54] in a lot of the ones alarming but not in d3 there is a systemd service in failed state, ferm [14:33:00] Moritz is working on restoring the service [14:33:12] what does ferm do? [14:33:27] basically a nice frontend for iptables [14:33:51] but it got upset for example like [14:33:52] Sep 08 14:02:59 an-druid1001 ferm[45995]: DNS query for 'prometheus1004.eqiad.wmnet' failed: query timed out [14:34:03] during the troubles [14:34:14] Also, re: 1005 being slow. It had a load avg >300, which, while not a good proxy for many things, is still an indication that things are slow [14:34:24] load avg is rapidly dropping now [14:34:25] hahahah yes [14:34:45] we do have some ceiling for CPU/memory usage via cgroups on stat100x [14:34:48] (and the Zombies are gone) [14:34:55] RECOVERY - Check the last execution of refine_eventlogging_analytics on an-launcher1002 is OK: OK: Status of the systemd unit refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:36:04] My best guess is the machine was busy-waiting on things that didn't work due to the network outage. As a resu;t. PID 2211 (generate_anchor_dictionary_simple.py) was scheduled very little and so could not do much housekeeping, i.e. collecting exit statuses from the child processes it had created. [14:36:55] Whatever R job is running there (nice'd) is also gobbling lots of CPU and likely causing a lot of system-wait [14:37:10] And now the loadavh is >4550. Ouch. [14:37:13] ah so the zombies were only there due to the parent being slow to check their final status [14:37:14] *450* [14:37:25] That's what I suspect, yes [14:38:35] htop is not looking pretty :) [14:38:41] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on icinga1001 is OK: (C)1000 gt (W)100 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_jumbo-eqiad [14:39:30] GoranSM: hi! are you around? [14:40:01] your R processes on stat1005 are consuming a ton of resources [14:40:23] Could this be a cron overrun? (A cronjob being so slow that it overlaps with a subsequent/previous one, thus being even slower and getting worse over time) [14:41:26] There's about 30 R processes, each using between 50% and 300%CPU. [14:41:33] (probably more in spikes) [14:42:10] last loging from GoranSM was 2 days ago on 1005, it is probably a cron klausman, you are right [14:42:10] Oh, and they all have the same commandline [14:44:29] ------------ stat1006 eventlog1002 thorium are currently down due to network issues ----------- [14:44:48] and stats.wikimedia.org is down as well [14:45:37] I've looks at Goran's crontab, but all jobs in there run at most once a day [14:46:31] I think it's this job: [14:46:33] # WDCM Engine (T)itelinks [14:46:35] 0 12 2,8,15,22,28 * * export USER=goransm && nice -10 Rscript /home/goransm/Analytics/WDCM/WDCM_Scripts/WDCM_Engine_Titles.R >> /home/goransm/Analytics/WDCM/WDCM_Output/WDCM_Logs/WDCM_Engine_Titles_RuntimeLog.txt 2>&1 [14:46:42] could be yes [14:47:01] The running processes have a log file open, named WDCM_Engine_Titles_RuntimeLog.txt, which is the best fit amongst his cronjobs. [14:47:30] And the jobspec says it would run at noon on the eighth of the month, so if it takes a few hours to run, this would be it [14:49:04] ok so a big job scheduled [14:49:32] So there is a node exporter running there. Where is the corresponding dashboard? [14:49:50] RECOVERY - Check the last execution of analytics-dumps-fetch-mediacounts on labstore1007 is OK: OK: Status of the systemd unit analytics-dumps-fetch-mediacounts https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:50:02] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=stat1005&var-datasource=thanos&var-cluster=analytics [14:50:05] klausman: --^ [14:50:20] Yeah, lovely rectangle graphs [14:51:22] Almost half of the CPU usage is system, not ideal. Is this hammering (spinning rust) disks? [14:51:57] It doesn't *look* like it from the dashboard [14:53:12] Looking at the weekly history, this seems to not be *too* unusual [14:53:44] So with being the ignorant newbie, I'm gonna go with "it's probably alright" [14:54:26] this is something that I have been trying to improve over time, so I'd really like to get some input from a more expert SRE :) [14:54:49] Which specific part do you want to improve? [14:54:51] these hosts (stat100x) can be used for multiple things, from crunching data to simply get dat from hadoop [14:55:14] historically the oom killer was running wildly, so I added some cgroups boundaries [14:55:26] for all the processes running under the user slice [14:56:01] you can check /etc/systemd/system/user.slice.d/puppet-override.conf [14:56:29] so the idea is to kill the topmost memory consumers when the RAM usage gets around 90% [14:57:06] and to limit the CPU quota to 90% of the cpus IIRC [14:57:28] it worked reasonably well so far, especially for the RAM usage [14:57:41] I suspect that we could do a better job for the CPUQuota [14:58:09] but the idea was to apply limits to all user processes (hence the user.slice) rather than per-process limits [14:58:14] CPU is a bit tricky in that a user space process can cause kernel-side CPU usage that is not accounted to it [14:58:18] sorry, per user slice limits [14:58:43] ottomata: I was about to rerun that thing, was just looking to see if the network outages caused it in any way [14:58:47] (meaning single user slice limits, ETOOMUCHOVERLAP) [14:58:54] (the failed refine, you don't need to do that, it's part of ops week) [14:59:06] And in this case, I am unsure what the system CPU time is actually spent on [14:59:12] milimetric: ya dunno if network outages caused it, but likely [14:59:25] klausman: yeah I was very unsure as well [14:59:25] s'ok i beatya to it! [14:59:26] :p [15:00:02] klausman: sometimes we get into these situations of high usage, not sure if we can do much about it without getting crazy in weird settings :D [15:00:21] anyway, if you want to check etc.. everything is in puppet [15:00:40] So it's not disk io on the R processes [15:02:25] hm, (Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: yarn/an-master1001.eqiad.wmnet@WIKIMEDIA, expecting: yarn/10.64.5.26@WIKIMEDIA; Host Details : local host is: "an-coord1001/10.64.21.104"; destination host is: "an-master1001.eqiad.wmnet":8032; ) [15:02:36] milimetric: o/ worth to follow up, thorium is down and so stats.wikimedia.org. Maybe in the long run we want to have a VM that can serve the website or similar [15:02:56] (an additional VM I mean in ganeti) [15:03:05] yeah, would make sense, it's mostly the caching servers doing the work, it could be served very effectively by a vm IMO [15:03:25] the only ? is related to the v1 content [15:03:40] but we can open a task and discuss it [15:03:42] meh, it's ok if that's down when thorium is down [15:03:42] What's Thorium? [15:03:56] it's kind of like the public web host [15:04:00] thorium.eqiad.wmnet, it is one of our oldest nodes [15:04:11] Ah, ack. [15:04:47] milimetric: re kerberos, I suspect it was due to a network error, re-running should fix it [15:04:58] ok, just listing here because it sounded weird [15:05:02] will rerun [15:05:18] RECOVERY - Throughput of EventLogging NavigationTiming events on icinga1001 is OK: (C)0 le (W)1 le 7.295 https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration https://grafana.wikimedia.org/dashboard/db/eventlogging?panelId=6&fullscreen&orgId=1 [15:05:24] niceee [15:05:30] the hosts are back up [15:09:04] RECOVERY - Check the last execution of analytics-dumps-fetch-clickstream on labstore1007 is OK: OK: Status of the systemd unit analytics-dumps-fetch-clickstream https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:09:30] RECOVERY - Check the last execution of analytics-dumps-fetch-pageview on labstore1007 is OK: OK: Status of the systemd unit analytics-dumps-fetch-pageview https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:10:32] hm it would be nice if wikistats v2 was HA on some VMs [15:10:39] i wonder if we could just proxy v1 to thorium [15:10:55] then only v1 would go down [15:12:08] this is a very good idea [15:12:22] RECOVERY - Check the last execution of camus-eventgate-main_events on an-launcher1002 is OK: OK: Status of the systemd unit camus-eventgate-main_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:17:35] ottomata: ok if I roll restart the hadoop masters? [15:18:02] I don't see any post-net-outage weirdness [15:18:05] also milimetric --^ [15:18:28] elukey: sure [15:18:46] sure [15:19:14] ottomata: i am running it via a cookbook :) [15:19:30] already tested in hadoop test, runs fine [15:20:35] the nice follow up would be to create a cookbook to reboot the two hadoop masters safely [15:20:56] (currently checking the data loss warnings) [15:21:34] razzi: if you're around, that's something you may want to learn about, lemme know I can join bc [15:36:52] (03CR) 10Mforns: "Left one comment, but except from that LGTM! Thanks for this change :]" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/623141 (https://phabricator.wikimedia.org/T253439) (owner: 10Gerrit maintenance bot) [15:37:48] milimetric: we dropped real data this time, a lot of varnishkafka delivery errors sadly :( [15:38:16] yeah, I see that. I will tally up the actual loss so we have a record of it. [15:42:07] So that means we missed some traffic in the sense that the generated statistics etc will be incomplete, but the users that made those requests wouldn't have noticed, right? [15:48:07] klausman: on the caching nodes (say cp3060) we have the famous varnishkafka daemons that are responsible to ship HTTP request metadata to kafka. In this case, for some of the impacted cp nodes we observed the "delivery report error", that is a callback triggered when librdkafka (the lib that we use) gives up trying to send some messages to kafka (it uses some retry logic) [15:48:51] the interesting thing is that these delivery reports are coming from multiple DCs, and the outage was in eqiad only (row D) [15:49:28] the main issue seems that one of the kafa brokers impacted by the eqiad network outage (kafka jumbo runs only in eqiad, all the cache nodes connect to it via TLS) caused a ton of timeouts [15:50:16] for example: https://grafana.wikimedia.org/d/000000253/varnishkafka?viewPanel=20&orgId=1 [15:50:28] what also was strange was how long kafka-jumbo1006 stayed in the ISR [15:50:49] the leadership for its partitions should have automatically been moved to another broker [15:51:12] unclean shutdowns leading to timeouts are very weird for our version of kafka, we had other strange behaviors in the past IIRC [15:51:43] i saw 1006 being dropped from the ISR for partitions it wasn't the leader for [15:52:18] but it caused troubles when it was the leader? [15:52:32] (not giving up the throne to other replicas) [15:54:11] https://github.com/cloudera/hue/commit/c04d89a2770aeb884f44e6b0dff018ec8c22349d [15:54:14] \o/ [15:57:48] oh damn, my timing for pizza was off. I may have to tease people during the standup with me eating :D [16:01:55] klausman: standup! [16:22:11] razzi: do you want to rejoin? [16:22:16] or speak later? [16:23:06] Give me a couple minutes to snack :) [16:23:10] 9:30? [16:23:20] errr in 7 minutes? [16:24:00] sure! [16:24:40] (03PS3) 10Gerrit maintenance bot: Order entries by alphabetical order [analytics/refinery] - 10https://gerrit.wikimedia.org/r/623141 (https://phabricator.wikimedia.org/T253439) [16:25:05] (03CR) 10Ladsgroup: "> Patch Set 2:" (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/623141 (https://phabricator.wikimedia.org/T253439) (owner: 10Gerrit maintenance bot) [16:27:58] (03CR) 10Mforns: "Thanks for this change, Paul, it's great! :]" [analytics/reportupdater] - 10https://gerrit.wikimedia.org/r/623060 (https://phabricator.wikimedia.org/T193171) (owner: 10Paul Kernfeld) [16:33:07] mforns: I don't know if I had clicked save on this, I just saw I was mid-edit since last week, but it's here now: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Geoeditors [16:55:32] 10Analytics, 10Analytics-Kanban, 10Privacy Engineering, 10Product-Analytics, and 3 others: Drop data from Prefupdate schema that is older than 90 days - https://phabricator.wikimedia.org/T250049 (10nettrom_WMF) @Milimetric : I've gone through the various subtasks and changes we made before the tracking lis... [17:25:10] * elukey afk! [17:28:50] ottomata: hey :] there's a community member that wrote an improvement to reportupdater, and they used a python3 lib called pid. It is not already available in an-launcher1002 for python3, does it mean we need to debianize it if we want to have reportupdater use it? [17:29:17] thanks milimetric! [17:30:16] hm [17:30:35] mforns: that is probably the simplest route, maybe in the future we can start using anaconda-wmf for our stuff too [17:30:46] but often debianianizing a simple python package isn't too hard [17:31:31] I see [17:32:28] ottomata: do you have an example of another python lib we debianized? I could try :] [17:33:09] mforns: it is kinda hard without access to the wmf build server, setting up the env is a little cumbersome [17:33:20] ah [17:33:23] but [17:33:24] https://wikitech.wikimedia.org/wiki/Git-buildpackage#How_to_build_a_Python_deb_package_using_git-buildpackage [17:33:26] those are probably old [17:33:34] but woudl kidna work [17:33:46] ok, will look! thx [17:34:14] mforns: some other examples [17:34:15] https://gerrit.wikimedia.org/r/admin/repos/q/filter:operations%252Fdebs%252Fpython [17:38:09] ottomata: thanks! [17:49:33] Not sure if you guys are aware, but there's an e-mail titled `Let's Encrypt certificate expiration notice for domain "archiva-new.wikimedia.org"` which says that `archiva-new.wikimedia.org` will expire on `10 Sep 20 10:00 +0000` (cc ottomata) [17:49:51] elukey: ^ archiva-new is gone rigth? [17:52:45] ottomata: the latest SearchSatisfaction fail is something else, it fails with DROPMALFORMED too [17:53:25] I can take a look, but I thought you'd fixed that last time [17:53:34] looking [17:54:20] milimetric: that looks like a schema lookup malfunctions... lookinbg [17:55:20] hmm which job is this, it sholud be eventlogging_legacy [17:55:28] i should put the job name into the email alert... [17:55:29] :p [17:56:22] +1 to that ottomata :) [18:01:33] ottomata: it is gone yes [18:03:03] yeah, I reran the wrong eventlogging_analytics a couple times until I remembered :) It'd be nice in the email, but also it'd be nice if the error message gave a little more of the stack, right now it just says "NullPointerException" [18:03:51] milimetric: somehow we got data from old eventlogging-processor for SearchSatisfaction [18:03:58] and that data happened to be the first line in that hour [18:04:09] ah, that thing yall said wouldn't happen :) [18:04:31] btw, also weird but not broken, this druid indexing has been going on for a few hours: https://hue.wikimedia.org/oozie/list_oozie_workflow/0091707-200720135922440-oozie-oozi-W/?coordinator_job_id=0028016-191216160148723-oozie-oozi-C [18:04:44] milimetric: Just saw ou email about dataloss and there must be something wrong in your computation - 97% loss would mean an error, not a warning [18:05:11] joal: I was just looking at that more, yeah, what I see when I run the checker is this line: [18:05:22] cp2031.codfw.wmnet false 32897525 3592261577 3559364052 31370 3592292946 9189739 11349131 12358655 [18:05:55] so that means 3559364052 is missing out of 3592261577, and that's the vast majority of the total, so doesn't look like a computation error [18:06:11] maybe that host lost all its traffic, but would still not represent a global loss of 97% :) [18:07:04] right, I'm just computing the total lost out of the total expected, for the nodes that report non-false positives, so I guess it's misleading, for sure [18:10:18] also milimetric, since the minimal value of sequence-id is very low (31370) we can assume that the varnish-kafka's host got restarted, and therefore numbers shouldn't be taken into account [18:11:14] normally we have a filter for those (if MIN(seq_id) = 1, but as the error is network, maybe the small-rows of the first hits got missed [18:11:25] oh, hm, right, I'll have to take a closer look. I guess I'll wait until all the loss is reported and things return to normal and do an overall loss computation [18:12:23] here we are talking 32M rows (count-distinct) vs 359M expected - 1 order of magnitude - This feels incorrect :) [18:13:08] I have no idea, I thought the outage was bad and we lost a lot of data [18:14:23] milimetric: a relatively easy way to check is to look for hosts without errors, and see how many rows they have proceeded - If it's in the tens of M, then we're ok, if it's in the hundreds of millions, we're not [18:15:07] they're usually fairly equally balanced then? [18:15:34] milimetric: as for order of magnitude, yes - as long as comparing in the same DC [18:16:53] k, will do that [18:32:23] 10Analytics, 10Analytics-EventLogging, 10Event-Platform: eventlogging-processor should fail to produce schemas that have been migrated to Event Platform - https://phabricator.wikimedia.org/T262304 (10Ottomata) [18:36:02] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/623141 (https://phabricator.wikimedia.org/T253439) (owner: 10Gerrit maintenance bot) [19:20:43] GOne for tonight team - see you tomorrow [21:28:00] 10Analytics, 10Analytics-EventLogging, 10Event-Platform: eventlogging-processor should fail to produce schemas that have been migrated to Event Platform - https://phabricator.wikimedia.org/T262304 (10Nuria) * i think* this happens cause the clients are running an old version of the code prior to the changes,... [21:56:43] 10Analytics, 10Analytics-Kanban, 10Product-Analytics: Technical contributors emerging communities metric definition, thick data - https://phabricator.wikimedia.org/T250284 (10jwang) @Jhernandez thank you very much for your review and edits. And thank you for bringing up the formatting issue on smaller scre... [22:37:01] 10Analytics, 10Research: Citation Usage: Can code be removed? - https://phabricator.wikimedia.org/T262349 (10Jdlrobson) [22:37:14] 10Analytics, 10Research: Citation Usage: Can instrumentation code be removed? - https://phabricator.wikimedia.org/T262349 (10Jdlrobson) [23:42:42] 10Analytics-Radar, 10Research: Citation Usage: Can instrumentation code be removed? - https://phabricator.wikimedia.org/T262349 (10Nuria)