Recent Posts

    Authors

    Published

    Tag Cloud

    301 302 404 accessibility accounts ACLs advertising aggregation Agile Analytics android APP Article attachments awards backup BCM beta browser business continuity Calendar case-study categories Chrome citigroup cms codes coding standards Complaints contact management software control panel crm CSS customer management software customer relationship system customize database DataModel DDoS demo design designer device compatibility difference distribute a published article via email DND DNS documents drag & drop Editor email EOL erp event Excel featured feeds file manager file sharing file volume Firefox Firewall HA hack Handlebar how-to HTML HTML5 HTTP HTTPS iCal IE Instructions intranet iOS iPad Java JavaScript JDBC JDK Jenkins Job Track Journal JSON JVM landing-page launcher layered database layout logging login mac marketing menu meta Microsoft Outlook mobile module modules mustache navigation NTLM offline page painter password passwords PCI policy poll pricing privacy PROXY publish publisher publsher PWA redirect Redundancy release release-notes Report Tool Reports Responsive ReST RESTFul Rich text RSS Safari sandbox sanity schedule scrum search security SEO sessions setup shipping site builder source spell SQL Injection SSL SSO standards store stSoftware support survey Swagger Task template testimonial Threads timezone tinyMCE Transaction Search trigger twitter twitter bootstrap Ubuntu unit tests unsubscribe URL validation WC3 AAA web folders web services webdav windows 8 wizard workflow WYSIWYG XLS XLST XML XPath XSS

    How are server issues (faults etc) monitored?

    Automated heartbeat monitoring and health checks.

    We have automated heart beat monitors which checks the health of the servers and a number of background tasks. In addition to checking the list of known tasks database read & write actions are performed and the available disk space is checked. If the heart beat monitor itself takes more than 5 minutes to run an alert is sent. The heart beat monitor runs every 15 minutes.

    Heart beat message

    The list of tasks that the system will monitor are defined in the class DBTask.

    List of tasks

    A task is defined by:-

    • code which uniquely identifies this task
    • The human readable label. 
    • Type of task
      • One per system
      • One per application/server
      • One per database layer.
    • If this task is enabled.
    • HA (high availability) specifies that two or more machines will run the same task, if the the primary version of this task fails to update the heartbeat within the heart beat duration, the standby task runner will start automatically. If the first machine running the primary task responds after the standby task has started the primary will become the standby task.
    • Heart beat duration is the time that a task must complete in, if the task takes longer than this period it's declared dead ( standby runner will start if in HA mode)
    • Process duration is how long one run of this task is expected to task. 
    • Notification period is how frequently to notify that this process is dead or has an error.

    Task screen

    Each server has it's own set of logs including:- 

    All the logging is configured via log4j

    Email alerts for serious errors are sent automatically via a Log4J delayed SMTP append-er which will batch errors into one email if more than one per minute was to be generated.