We're using Java Hotspot 1.6 with a 10.3.6 WebLogic Cluster with 2 nodes running OSB 11.1.1.7 behind a physical load balancer:
java version "1.6.0_91" Java(TM) SE Runtime Environment (build 1.6.0_91-b13) Java HotSpot(TM) 64-Bit Server VM (build 20.91-b07, mixed mode)
We've tuned the JVM to use CMS and avoid long running garbage collections:
-XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:NewRatio=4 -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60 -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark
Most of the pause phases of the GCs (initial mark and remark) take less than a second, we don't see any failed CMS runs and we don't see Full GCs in the gc logs.
However, we're receiving timeout errors on clients trying to connect to OSB a few seconds after a completed CMS GC execution (particularly in an Oracle SOA installation with lots of composites that call OSB):
OSB 1 gc.log:
> 2016-07-28T08:31:52.656-0300: 492247.689: [GC > 2016-07-28T08:31:52.656-0300: 492247.690: [ParNew: > 1027265K->22014K(1132416K), 0.0412613 secs] > 4046262K->3043322K(6165632K), 0.0420791 secs] [Times: user=0.25 > sys=0.00, real=0.04 secs] 2016-07-28T08:31:52.701-0300: 492247.735: > [GC [1 CMS-initial-mark: 3021307K(5033216K)] 3043844K(6165632K), > 0.0683781 secs] [Times: user=0.08 sys=0.00, real=0.07 secs] 2016-07-28T08:31:52.770-0300: 492247.804: [CMS-concurrent-mark-start] > 2016-07-28T08:31:56.971-0300: 492252.005: [CMS-concurrent-mark: > 4.201/4.201 secs] [Times: user=10.55 sys=0.16, real=4.20 secs] 2016-07-28T08:31:56.971-0300: 492252.005: > [CMS-concurrent-preclean-start] 2016-07-28T08:31:57.064-0300: > 492252.098: [CMS-concurrent-preclean: 0.090/0.093 secs] [Times: user=0.12 sys=0.00, real=0.09 secs] 2016-07-28T08:31:57.064-0300: > 492252.098: [CMS-concurrent-abortable-preclean-start] 2016-07-28T08:32:01.198-0300: 492256.231: [GC > 2016-07-28T08:32:01.198-0300: 492256.232: [ParNew: > 1028128K->30711K(1132416K), 0.0502326 secs] > 4049435K->3053383K(6165632K), 0.0508313 secs] [Times: user=0.34 > sys=0.00, real=0.05 secs] CMS: abort preclean due to time > 2016-07-28T08:32:02.128-0300: 492257.161: > [CMS-concurrent-abortable-preclean: 4.027/5.063 secs] [Times: > user=7.62 sys=0.24, real=5.06 secs] 2016-07-28T08:32:02.134-0300: > 492257.167: [GC[YG occupancy: 186361 K (1132416 K)]2016-07-28T08:32:02.134-0300: 492257.167: [GC > 2016-07-28T08:32:02.134-0300: 492257.167: [ParNew: > 186361K->25052K(1132416K), 0.0231613 secs] > 3209034K->3049538K(6165632K), 0.0236069 secs] [Times: user=0.17 > sys=0.00, real=0.02 secs] 2016-07-28T08:32:02.158-0300: 492257.191: > [Rescan (parallel) , 0.0289939 secs]2016-07-28T08:32:02.187-0300: > 492257.220: [weak refs processing, 0.3412320 secs] [1 CMS-remark: 3024485K(5033216K)] 3049538K(6165632K), 0.4179878 secs] [Times: > user=0.67 sys=0.00, real=0.42 secs] 2016-07-28T08:32:02.552-0300: > 492257.586: [CMS-concurrent-sweep-start] 2016-07-28T08:32:07.144-0300: 492262.177: [CMS-concurrent-sweep: 4.591/4.591 secs] [Times: user=14.05 sys=0.22, real=4.59 secs] 2016-07-28T08:32:07.146-0300: > 492262.180: [CMS-concurrent-reset-start] 2016-07-28T08:32:07.283-0300: 492262.316: [CMS-concurrent-reset: 0.137/0.137 secs] [Times: user=0.36 sys=0.01, real=0.14 secs] 2016-07-28T08:32:36.201-0300: 492291.234: > [GC 2016-07-28T08:32:36.202-0300: 492291.234: [ParNew: > 1030955K->45449K(1132416K), 0.0671313 secs] > 2269248K->1285059K(6165632K), 0.0676485 secs] [Times: user=0.41 > sys=0.00, real=0.07 secs]
Is in the OSB WebLogic access log that we noticed this strange behavior, usually 5 or 10 seconds after CMS starts, there's a 20+ second gap between requests in the access log, when usually we see dozens of calls per second.
OSB1 access log:
2016-07-28 08:32:03 POST /service1/ 200 0.043 2016-07-28 08:32:03 POST /service1/ 200 0.074 2016-07-28 08:32:03 POST /service1/ 200 0.045 2016-07-28 08:32:03 POST /service1/ 200 0.053 2016-07-28 08:32:35 POST /service1/ 200 0.076
And at the same time some of the calls made from the SOA server fail:
SOA log error: <Jul 28, 2016 8:32:19 AM ART> <oracle.integration.platform.blocks.soap> <Unable to invoke endpoint URI "https://esburl.mycompany.com/service1" successfully due to: Unable to invoke endpoint URI "https://esburl.mycompany.com/service1" successfully due to: javax.xml.soap.SOAPException: javax.xml.soap.SOAPException: Message send failed: Read timed out>
There's no errors reported in the OSB weblogic output log. We've taken some threadumps while GC was running but haven't been able to pinpoint the issue. We thought that if initial mark and remark were fast enough we wouldn't see this kind of problems, but we still have doubts regarding if this is a JVM, a WebLogic or an application issue (OSB). Anyone has seen something like this with WebLogic or another applications server?
Thanks!