cancel
Showing results for 
Search instead for 
Did you mean: 
Read only

Scaling up EC2 instances in AWS cluster with jgroups protocol leads to increase number of DB connect

Former Member
0 Likes
564

Hi guys,

Set up of the environment: We are using an e-commerce platform running on Hybris (Tomcat) that uses jgroups version 3.4.1. Please note that we are using TCP as opposed to UDP because AWS can't support multicast on our VPC.

On the last few occasions, when we had discount offer sale, we had to increase (auto scaling) the number of EC2 instances that serve customer traffic - let’s say 7 new nodes, where before that they were 3. This is where we are facing some weird issues - the DB connections get significantly high(for around 12 customer facing EC2 instances in the cluster - it goes up to 30000 - 35000 DB connections per EC2 instance) and that leads to DB resource contentions and eventually the application stops serving customer traffic. Note - we are facing increase in DB connections during the time when we scale up. Once all EC2 instances are up and running let’s say around 18 nodes in the cluster, then the number of DB connections is within normal limit - more or less around 100.

We have configured JDBC_PING in jgroups and we think some of the threads create few thousand connections to the DB. We have provided below the jgroups-tcp.xml

TCP loopback="true" 

recv_buf_size="${tcp.recv_buf_size:20M}" 

send_buf_size="${tcp.send_buf_size:640K}"

discard_incompatible_packets="true" 

max_bundle_size="64K" 

max_bundle_timeout="30"

enable_bundling="true" 

use_send_queues="true"

sock_conn_timeout="300" 

timer_type="new" 

timer.min_threads="4" 

timer.max_threads="10" 

timer.keep_alive_time="3000"

timer.queue_max_size="500" 

thread_pool.enabled="true" 

thread_pool.min_threads="40"

thread_pool.max_threads="250"

thread_pool.keep_alive_time="5000" 

thread_pool.queue_enabled="false" 

thread_pool.queue_max_size="10000"

thread_pool.rejection_policy="discard" 

oob_thread_pool.enabled="true" 

oob_thread_pool.min_threads="5"

oob_thread_pool.max_threads="40"

oob_thread_pool.keep_alive_time="5000" 

oob_thread_pool.queue_enabled="false"

oob_thread_pool.queue_max_size="10000"

oob_thread_pool.rejection_policy="discard" 

bind_addr="${hybris.jgroups.bind_addr}" 

bind_port="${hybris.jgroups.bind_port}"

JDBC_PING connection_driver="${hybris.database.driver}" 
 connection_password="${hybris.database.password}" 
 connection_username="${hybris.database.user}"

connection_url="${hybris.database.url}" 

initialize_sql="${hybris.jgroups.schema}"

datasource_jndi_name="${hybris.datasource.jndi.name}

MERGE2 min_interval="10000" max_interval="30000" />

FD_SOCK />

FD timeout="3000" max_tries="3" />

VERIFY_SUSPECT timeout="1500" />

BARRIER />

pbcast.NAKACK use_mcast_xmit="false" exponential_backoff="500" discard_delivered_msgs="true" />

 UNICAST />

pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M" />

pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" />


UFC max_credits="20M" min_threshold="0.6" />

MFC max_credits="20M" min_threshold="0.4" />


FRAG2 frag_size="60K" />

pbcast.STATE_TRANSFER />"

Also we have attached thread dump from one of the affected EC2 instances.

We captured some jgroups statistics (we are using jgroups Probe - JMX MFC and NAKACK), although we do not have a full understanding of what this gives us. NAKACK statistics:

1 (1959 bytes): local_addr=hybrisnode-82243 [5fa277bc-8216-d941-7c67-0792b84f671e] cluster=tu-broadcast view=[hybrisnode-83113|2606] (14) [hybrisnode-83113, hybrisnode-0, hybrisnode-82178, hybrisnode-83218, hybrisnode-82237, hybrisnode-83219, hybrisnode-83245, hybrisnode-83226, hybrisnode-82191, hybrisnode-82243, hybrisnode-8330, hybrisnode-83164, hybrisnode-82248, hybrisnode-83233] physical_addr=172.31.82.243:7800

jmx=NAKACK={num_messages_received=4725458, use_mcast_xmit=false, current_seqno=271042, ergonomics=true, xmit_table_max_compaction_time=600000, become_server_queue_size=50, non_member_messages=0, xmit_stagger_timeout=200, print_stability_history_on_failed_xmit=false, discard_delivered_msgs=true, suppress_time_non_member_warnings=60000, xmit_rsps_sent=2161, xmit_table_num_rows=5, stats=true, xmit_from_random_member=false, size_of_all_messages=1152, log_not_found_msgs=true, xmit_table_resize_factor=1.2, xmit_table_missing_messages=0, id=15, max_rebroadcast_timeout=2000, msgs=hybrisnode-82243:

hybrisnode-83113: [2894377 (2894377)] hybrisnode-0: [47023917 (47023917)] hybrisnode-83219: [579595 (579595)] hybrisnode-83245: [251862 (251862)] hybrisnode-82237: [2827706 (2827706)] hybrisnode-82243: [271042 (271042) (size=17, missing=0, highest stability=271025)] hybrisnode-8330: [7676 (7676)] hybrisnode-82248: [733 (733)] hybrisnode-83164: [596664 (596664)] hybrisnode-82191: [263111 (263111)] hybrisnode-82178: [760904 (760904)] hybrisnode-83218: [947031 (947031)] hybrisnode-83233: [206 (206)] hybrisnode-83226: [368099 (368099)]

, pending_xmit_requests=0, xmit_table_size=17, xmit_table_msgs_per_row=10000, retransmit_timeout=[I@67cdc0ee, size_of_all_messages_incl_headers=2053, max_msg_batch_size=100, exponential_backoff=500, num_messages_sent=271033, log_discard_msgs=true, xmit_reqs_received=2161, name=NAKACK, xmit_rsps_received=1413, use_mcast_xmit_req=false, xmit_reqs_sent=1413, use_range_based_retransmitter=true}

version=3.4.1.Final

MFC statistics: -- sending probe on /ff0e:0:0:0:0:0:75:75:7500

1 (1208 bytes):

local_addr=hybrisnode-82243 [5fa277bc-8216-d941-7c67-0792b84f671e] cluster=tu-broadcast view=[hybrisnode-83113|2606] (14) [hybrisnode-83113, hybrisnode-0, hybrisnode-82178, hybrisnode-83218, hybrisnode-82237, hybrisnode-83219, hybrisnode-83245, hybrisnode-83226, hybrisnode-82191, hybrisnode-82243, hybrisnode-8330, hybrisnode-83164, hybrisnode-82248, hybrisnode-83233] physical_addr=172.31.82.243:7800

jmx=MFC={ergonomics=true, average_time_blocked=0.0, number_of_credit_responses_sent=100, min_credits=8000000, ignore_synchronous_response=false, number_of_credit_requests_received=1, min_threshold=0.4, total_time_blocked=0, stats=true, receivers=hybrisnode-83113: 15207925

hybrisnode-0: 16166436 hybrisnode-83219: 16151370 hybrisnode-83245: 14427593 hybrisnode-82237: 19821107 hybrisnode-82243: 18221987 hybrisnode-8330: 17618119 hybrisnode-82248: 19950042 hybrisnode-83164: 11538170 hybrisnode-82191: 8629123 hybrisnode-82178: 17534244 hybrisnode-83218: 17558613 hybrisnode-83233: 19985889 hybrisnode-83226: 18462899 , max_credits=20000000, name=MFC, number_of_credit_requests_sent=0, number_of_credit_responses_received=106, id=44, number_of_blockings=0, max_block_time=5000}

version=3.4.1.Final

Our questions , who could support here, are: - what could lead to such an increase of the DB connections at the time when we scale up customer facing EC2 instances from lets say 5 to 10 nodes ? (also please bear in mind that we are having other instances in the cluster - instances for batch processing, instances for employees managing cockpits and customer support) - if somehow the jgroups is leading to the increase in DB connections - how can we improve our configuration below ?

Thank you kindly Simeon

Accepted Solutions (0)

Answers (0)