Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Latest commit

 

History

History
127 lines (91 loc) · 10.6 KB

troubleshooting.md

File metadata and controls

127 lines (91 loc) · 10.6 KB

Troubleshooting LME install

Troubleshooting overview

Figure 1: Troubleshooting overview diagram

Diagram Ref Protocol information Process Information Log file location Common issues
a Outbound WinRM using TCP 5985 Link is HTTP, underlying data is authenticated and encrypted with Kerberos.

See this Microsoft article for more information
On the Windows client, Press Windows key + R. Then type 'services.msc' to access services on this machine. You should have:

‘Windows Remote Management (WS-Management)’
and
‘Windows Event Log’

Both of these should be set to automatically start and be running. WinRM is started via the GPO that is applied to clients.
Open Event viewer on Windows Client. Expand ‘Applications and Services Log’->’Microsoft’->’Windows’->’Eventlog-ForwardingPlugin’->Operational “The WinRM client cannot process the request because the server name cannot be resolved.”
This is due to network issues (VPN not up, not on local LAN) between client and the Event Collector.
b Inbound WinRM TCP 5985 On the Windows Event Collector, Press Windows key + R. Then type 'services.msc' to access services on this machine. You should have:

‘Windows Event Collector’

This should be set to automatic start and running. It is enabled with the GPO for the Windows Event Collector.
Open Event viewer on Windows Event Collector.

Expand ‘Applications and Services Log’->’Microsoft’->’Windows’->’EventCollector’->Operational

Also, in Event Viewer check the subscription is active and clients are sending in logs. Click on ‘Subscriptions’, then right click on ‘lme’ and ‘Runtime Status’. This will show total and active computers connected.
Restarting the Windows Event Collector machine can sometimes get clients to connect.
c Outbound TCP 5044.

Lumberjack protocol using TLS mutual authentication. Certificates generated as part of the install, and downloaded as a ZIP from the Linux server.
On the Windows Event Collector, Press Windows key + R. Then type 'services.msc' to access services on this machine. You should have:

‘winlogbeat’.

It should be set to automatically start and is running.
%programdata%\winlogbeat\logs\winlogbeat TBC
d Inbound TCP 5044.

Lumberjack protocol using TLS mutual authentication. Certificates generated as part of the install.
On the Linux server type ‘sudo docker stack ps lme’, and check that lme_logstash, lme_kibana and lme_elasticsearch all have a current status of running. On the Linux server type:

‘sudo docker service logs -f lme_logstash’
TBC

Elastic Specific Troubleshooting

Elastic maintain a series of troubleshooting guides which should be consulted as part of the standard investigation process if the issue you are experiencing is within the Elastic stack within LME.

These guides can be found here and cover a number of common issues which may be experienced.

Common Errors

Unhealthy Cluster Status

There are a number of reasons why the cluster's health may be yellow or red, but a common cause is unassigned replica shards. As LME is a single-node instance by default this is means that replicas will never be assigned, but this issue is commonly caused by built-in indices which do not have the index.auto_expand_replicas value correctly set. This will be fixed in a future release of Elastic, but can be temporarily diagnosed and resolved as follows:

Check the cluster health by running the following request against Elasticsearch (an easy way to do this is to navigate to Dev Tools in Kibana under Management on the left-hand menu):

GET _cluster/health?filter_path=status,*_shards

If it shows any unassigned shards, these can be enumerated with the following command:

GET _cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state

If the UNASSIGNED shard is shown as r rather than p this means it's a replica. In this case the error can be safely fixed in the single-node default installation of LME by forcing all indices to have a replica count of 0 using the following request:

PUT _settings
{
  "index.number_of_replicas": 1
}

Further information on this and general advice on troubleshooting an unhealthy cluster status can be found here, if the above solution was unable to resolve your issue.

Windows Log with Error Code #2150859027

If you are on Windows 2016 or higher and are getting error code 2150859027, or messages about HTTP URLs not being available in your Windows logs, we suggest looking at this guide.

No logs forwarded from Member Servers

Check the following:

  • Sysmon service is running on the client
  • The LME-WEC-Client-GPO is applying to the member server
  • That the member server has been rebooted to apply permissions to the logs (see issue #41)

Events not forwarding from Domain Controllers

Please be aware that Logging Made Easy does not currently support logging Domain Controllers, and the log volumes may be significant from servers with this role. If you wish to proceed forwarding logs from your Domain Controllers please be aware you do this at your own risk! Monitoring such servers has not been tested or endorsed by the NCSC and may have unintended side effects.

Importing the Kibana dashboard hangs

Importing the dashboards manually is described in section 4.1.4. First, ensure you have modified the latest dashboards file from Github to replace ChangeThisDomain with your Kibana server’s DNS name. Note that it’s imperative that you keep the trailing backslash (e.g. https://kibanahostname.example.com\) otherwise importing the file to Kibana will hang. This is discussed more in issue #74.

Events not forwarded to Kibana

The winlogbeat service installed in section 3.3 is responsible for sending events from the collector to Kibana. Confirm the winlogbeat service is running and check the log file (C:\ProgramData\winlogbeat\logs) for errors.

By default the ForwardedEvents maximum log size is around 20MB so events will be lost if the winlogbeat service stops. Consider increasing the size of the ForwardedEvents log file to help reduce log loss in this scenario. Historical logs are sent once the winlogbeat service starts.

  • Open Microsoft Event View (eventvwr)
  • Expand Windows Logs and right click Forwarded Events
  • Click properties
  • Adjust Maximum log size (KB) to a higher value. Note that the system will automatically adjust the size to the nearest multiple of 64KB.

Adjusting the log size

Kibana Discover View Showing Wrong Index

If the Discover section of Kibana is persistently showing the wrong index by default it is worth checking that the winlogbeat index pattern is still set as the default within Kibana. This can be done using the steps below:

Select "Stack Management" from the left hand menu:

Check Default Index

Select "Index Patterns" under Kibana Stack Management:

Check Default Index

Verify that the "Default" label is set next to the winlogbeat-* Index pattern:

Check Default Index

If this Index pattern is not selected as the default, this can be re-done by clicking on the winlogbeat-* pattern and then selecting the following option in the subsequent page:

Set Default Index

Re-Indexing Errors

For errors encountered when re-indexing existing data as part of an an LME version upgrade please review the Elastic re-indexing documentation for help, available here.

Illegal Argument Exception Whilst Re-Indexing

With the correct mapping in place it is not possible to store a string value in any of the fields which represent IP addresses, for example source.ip or destination.ip. If any of these values are represented in your current data as strings, such as LOCAL it will not be possible to successfully re-index with the correct mapping. In this instance the simplest fix is to modify your existing data to store the relevant fields as valid IP representations using the update_by_query method, documented here.

An example of this is shown below, which may need to be modified for the particular field that is causing problems:

POST winlogbeat-11.06.2021/_update_by_query
{
  "script": {
    "source": "ctx._source.source.ip = '127.0.0.1'",
    "lang": "painless"
  },
  "query": {
    "match": {
      "source.ip": "LOCAL"
    }
  }
}

Note that this will need to be run for each index that contains problematic data before re-indexing can be completed.

TLS Certificates Expired

For security the self-signed certificates generated for use by LME at install time will only remain valid for a period of two years, which will cause LME to stop functioning once these certificates expire. In this case the certificates can be recreated by following the instructions detailed here.

Dashboard Update Script Failing

If you encounter an error when the dashboards are updated using the dashboard update script, either manually or as part of automatic updates, this may mean that your current version of Elastic is too old to support the minimum functionality required for the new dashboard versions. Ensure that the latest supported version of the Elastic stack is in use with the following command:

cd /opt/lme/Chapter\ 3\ Files/
sudo ./deploy.sh update

Then upload the latest dashboards by following one of the methods described here.