Home > elasticsearch > ElasticSearch Server Randomly Stops Working

ElasticSearch Server Randomly Stops Working

March 10Hits:18
Advertisement

I have 2 ES servers that are being fed by 1 logstash server and viewing the logs in Kibana. This is a POC to work out any issues before going into production. The system has ran for ~1 month and every few days, Kibana will stop showing logs at some random time in the middle of the night. Last night, the last log entry I received in Kibana was around 18:30. When I checked on the ES servers, it showed the master running and the secondary not running (from /sbin/service elasticsearch status), but I was able to do a curl on the localhost and it returned information. So not sure what's up with that. Anyway, when I do a status on the master node, I get this:

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true' {   "cluster_name" : "gis-elasticsearch",   "status" : "red",   "timed_out" : false,   "number_of_nodes" : 6,   "number_of_data_nodes" : 2,   "active_primary_shards" : 186,   "active_shards" : 194,   "relocating_shards" : 0,   "initializing_shards" : 7,   "unassigned_shards" : 249 } 

When I view the indexes, via "ls ...nodes/0/indeces/" it shows all indexes being modified today for some reason and there are new file for today's date.So I think I'm starting to catch back up after I restarted both servers but not sure why it failed in the first place. When I look at the logs on the master, I only see 4 warning errors at 18:57 and then the 2ndary leaving the cluster. I don't see any logs on the secondary (Pistol) on why it stopped working or what truly happened.

[2014-03-06 18:57:04,121][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147630] [2014-03-06 18:57:04,124][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147717] [2014-03-06 18:57:04,124][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147718] [2014-03-06 18:57:04,124][WARN ][transport                ] [ElasticSearch Server1] Transport response handler not found of id [64147721] 

[2014-03-06 19:56:08,467][INFO ][cluster.service ] [ElasticSearch Server1] removed {[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.1.1.10:9301]]{client=true, data=false},}, reason: zen-disco-node_failed([Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.13.3.46:9301]]{client=true, data=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout [2014-03-06 19:56:12,304][INFO ][cluster.service ] [ElasticSearch Server1] added {[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.1.1.10:9301]]{client=true, data=false},}, reason: zen-disco-receive(join from node[[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.13.3.46:9301]]{client=true, data=false}])

Any idea on additional logging or troubleshooting I can turn on to keep this from happening in the future? Since the shards are not caught up, right now I"m just seeing a lot o debug messages about failed to parse. I'm assuming that will be corrected once we catch up.

[2014-03-07 10:06:52,235][DEBUG][action.search.type ] [ElasticSearch Server1] All shards failed for phase: [query] [2014-03-07 10:06:52,223][DEBUG][action.search.type ] [ElasticSearch Server1] [windows-2014.03.07][3], node[W6aEFbimR5G712ddG_G5yQ], [P], s[STARTED]: Failed to execute [[email protected]] lastShard [true] org.elasticsearch.search.SearchParseException: [windows-2014.03.07][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"10m"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"(ASA AND Deny)"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1394118412373,"to":"now"}}}]}}}}}}}},"size":0}]]

Answers

Usual suspects for ES with Kibana are :

  • Too small amount of memory available for ES* (which you can investigate with any probe system such as Marvel, or something that will send you JVM data outside the VM for monitoring)
  • Long GC durations (turn on GC logging and see if they do not happen when the ES stop responding)

Also the "usual" setup for ES is 3 servers to allow better redundancy when one server is down. But YMMV.

You can try the new G1 garbage collector too, which has (in my case) a much better behavior than CMS in my Kibana ES.

The GC duration problem is usually the one that happens when you're looking somewhere else and will typically lead to a loss of data because ES stops responding.

Good luck with these :)

Related Articles

  • ElasticSearch Server Randomly Stops WorkingMarch 10

    I have 2 ES servers that are being fed by 1 logstash server and viewing the logs in Kibana. This is a POC to work out any issues before going into production. The system has ran for ~1 month and every few days, Kibana will stop showing logs at some r

  • Ubuntu Server randomly disconnects from Internet?August 7

    I'm having a problem where my Ubuntu Server random stops working. I can't access my websites, can't get my email, can't login via SSH, etc. I've got two theories: one is that Ubuntu has some kind of power setting that shuts down Internet after x hour

  • SSL replay attack when client/server random is missingMay 8

    Hey studying the SSL protocol, I'm wondering how can someone be able to do a replay attack if the server nonce is missing? All the material I find says that nonces prevent it, but theres no examples which specify why or how --------------Solutions---

  • DNS server randomly failsSeptember 8

    So everything was working correctly before, I had constant internet connection on my desktop computer. Then I bought a laptop and connected it with WiFi to my network and I had internet on that too until recently.. my DNS server randomly fails on my

  • Ubuntu 12.04.2 LTS Server - Randomly hangs with no loggingJanuary 24

    My server randomly hangs and becomes unresponsive without any logging (dmesg, syslog, kern.log, boot.log, and messages). I cannot predict when it is going to happen. Sometimes the server runs fine for months and suddenly it starts to happen again. In

  • Why does the SSL/TLS handshake have a client and server random?May 16

    In the SSL handshake both the client and server generate their respective randoms. The client then generates a pre master secret and encrypts it with the server's public key. However, why can't the client just generate the pre master secret and send

  • Purpose of client and server random numbers in SSL Handshake September 1

    This question already has an answer here: Why does the SSL/TLS handshake have a client and server random? 3 answers I could not understand why in the Client hello and Server hello in SSL Handshake client needs to send to server (publicly) a random nu

  • Server randomly stops responding

    Server randomly stops respondingJanuary 9

    I am running an FC7 server setup using the Perfect server guide, and running ISPConfig. The server randomly stops responding to all requests except for telnet, which always goes through. This happens fairly frequently with dovecot, apache, and SSH. R

  • Ubuntu server random hangupsFebruary 20

    this is my first post to this forum which I found through the superb podcast "It Conversations" from StackOverFlow. I am quite in my role as server administrator for an exhibition center in London. Basically we have a central file and sql server

  • WIndows based server randomly hangsApril 4

    I'll start with the specs: Intel S5500BC Board 2x Intel Xeon E5506 Processors 8x Corsair CM72DD2G1113 RAM 1x Western Digital Caviar Green WD10EADS-00P8B0 HDD (really 4x in RAID-10 but I destroyed the RAID thinking that was part of the issue) Thermalt

  • RDP to Windows Server 2008 Terminal Server randomly gives a blank screen at login?September 16

    Our uses randomly get a solid blue screen (desktop background colored blue, not BSOD) when logging into our 2008 Terminal Server. There is no TaskBar, no desktop icons, CTRL+ALT+Delete doesn't work, and no mouse events occur. The only way to end the

  • Server randomly not reachable. Can't detect the problemDecember 30

    Hi there, we run ISPConfig on Debian to manage our webserver with several virtual hosts. Everythings running fine, but the webserver suddenly isn't reachable any more. Login to ISPConfig or Webmail (on Port 81) still works fine. From the command line

  • Ubuntu server randomly loses LAN connectionFebruary 18

    I have this remote Ubuntu 12.04 minimal headless server which I access and manage over reverse-SSH-tunnel (because it's behind a NAT firewall). However, the issue I'm facing is, that randomly it cannot be accessed, whether remotely or over the LAN. W

  • Server randomly receiving ctrl-cSeptember 11

    I have CentOS-server on VirtualBox (Using Vagrant) which is running a very long php-script (It can take up to 5 weeks to fully complete) This script is runned over ssh using cygwin. The script is running fine, however at some random points the server

  • Active Directory SQL Server Random Handshake TimeoutsOctober 9

    We've installed active directory on a new server and joined our database server and web servers to this new domain. However, randomly, we'll get the following error. What would be the cause of the handshake timeout error we're sometimes seeing? The f

  • server randomly becoming unresponsiveOctober 31

    I've been having a sporadic issue that seems to occur completely at random with one of our Ubuntu servers. The server will randomly decide to stop responding to connections on all services (SSH, HTTP, etc) except for ping requests. It will still resp

  • loading webpages hosted on home server randomly failsDecember 10

    I have a server that is running from my home on a LAMP stack. I have a few sites that work fine when I am on my local network but when I try to access them from outside the local network they randomly do not load. Sometimes the page will load and som

  • How to rename and join an elasticsearch server preserving data?January 9

    I tested elasticsearch and decided to make a cluster. To this I renamed the node and defined a cluster cluster.name = mycluster node.name = firstnode When restarting the changes are taken into account but the existing data is not visible anymore (htt

  • SQL Server randomly slow to respond to complete querySeptember 11

    We have a SQL Server 2012 server. Randomly a query which normally takes 0 seconds will take 30-50 to return results. SELECT TransactionId, ItemNumber FROM Transactions WHERE CompanyId = 'MyCompany' AND VoucherNumber = '111111' AND ItemNumber = 0 AND

  • Curl receives empty repsonse form Elasticsearch serverSeptember 29

    I have a healthy ES cluster running on ec2. I'm tyring to connect Graylog server to the cluster using unicast, but Graylog receives an empty reponse from the unicast host. Using curl, I can reproduce the empty response. The result of curl 10.10.198.2

Copyright (C) 2017 ceus-now.com, All Rights Reserved. webmaster#ceus-now.com 14 q. 0.561 s.