A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://tech.marksblogg.com/detect-bots-apache-nginx-logs.html below:

Detecting Bots in Apache & Nginx Logs

The following walks through the contents of monitor.py. This script accepts web traffic logs piped in from stdin. This means you can tail a log on a remote server via ssh and run this script locally.

I'll first import two libraries from the Python Standard Library and the five external libraries installed via pip.

import sys
from urlparse import urlparse

import apache_log_parser
from colorama import Back, Style
import geoip2.database
from netaddr import IPNetwork, IPAddress
from user_agents import parse

In the following I've setup MaxMind's geoip2 library to use the "GeoLite2-City.mmdb" city-level library.

I've also setup apache_log_parser to work with the format my web logs are being stored in. Your log format may vary so please take the time to compare your web server's traffic logging configuration against the library's format documentation.

Finally, I have a dictionary of the CIDR blocks I found being owned by the seven firms. Included in this list is Baidu, which isn't a hosting or cloud provider per-se but nonetheless run bots that haven't always identified themselves by their user agent.

reader = geoip2.database.Reader('GeoLite2-City.mmdb')

_format = "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""
line_parser = apache_log_parser.make_parser(_format)

CIDRS = {
    'Amazon': ['107.20.0.0/14', '122.248.192.0/19', '122.248.224.0/19',
               '172.96.96.0/20', '174.129.0.0/16', '175.41.128.0/19',
               '175.41.160.0/19', '175.41.192.0/19', '175.41.224.0/19',
               '176.32.120.0/22', '176.32.72.0/21', '176.34.0.0/16',
               '176.34.144.0/21', '176.34.224.0/21', '184.169.128.0/17',
               '184.72.0.0/15', '185.48.120.0/26', '207.171.160.0/19',
               '213.71.132.192/28', '216.182.224.0/20', '23.20.0.0/14',
               '46.137.0.0/17', '46.137.128.0/18', '46.51.128.0/18',
               '46.51.192.0/20', '50.112.0.0/16', '50.16.0.0/14', '52.0.0.0/11',
               '52.192.0.0/11', '52.192.0.0/15', '52.196.0.0/14',
               '52.208.0.0/13', '52.220.0.0/15', '52.28.0.0/16', '52.32.0.0/11',
               '52.48.0.0/14', '52.64.0.0/12', '52.67.0.0/16', '52.68.0.0/15',
               '52.79.0.0/16', '52.80.0.0/14', '52.84.0.0/14', '52.88.0.0/13',
               '54.144.0.0/12', '54.160.0.0/12', '54.176.0.0/12',
               '54.184.0.0/14', '54.188.0.0/14', '54.192.0.0/16',
               '54.193.0.0/16', '54.194.0.0/15', '54.196.0.0/15',
               '54.198.0.0/16', '54.199.0.0/16', '54.200.0.0/14',
               '54.204.0.0/15', '54.206.0.0/16', '54.207.0.0/16',
               '54.208.0.0/15', '54.210.0.0/15', '54.212.0.0/15',
               '54.214.0.0/16', '54.215.0.0/16', '54.216.0.0/15',
               '54.218.0.0/16', '54.219.0.0/16', '54.220.0.0/16',
               '54.221.0.0/16', '54.224.0.0/12', '54.228.0.0/15',
               '54.230.0.0/15', '54.232.0.0/16', '54.234.0.0/15',
               '54.236.0.0/15', '54.238.0.0/16', '54.239.0.0/17',
               '54.240.0.0/12', '54.242.0.0/15', '54.244.0.0/16',
               '54.245.0.0/16', '54.247.0.0/16', '54.248.0.0/15',
               '54.250.0.0/16', '54.251.0.0/16', '54.252.0.0/16',
               '54.253.0.0/16', '54.254.0.0/16', '54.255.0.0/16',
               '54.64.0.0/13', '54.72.0.0/13', '54.80.0.0/12', '54.72.0.0/15',
               '54.79.0.0/16', '54.88.0.0/16', '54.93.0.0/16', '54.94.0.0/16',
               '63.173.96.0/24', '72.21.192.0/19', '75.101.128.0/17',
               '79.125.64.0/18', '96.127.0.0/17'],
    'Baidu': ['180.76.0.0/16', '119.63.192.0/21', '106.12.0.0/15',
              '182.61.0.0/16'],
    'DO': ['104.131.0.0/16', '104.236.0.0/16', '107.170.0.0/16',
           '128.199.0.0/16', '138.197.0.0/16', '138.68.0.0/16',
           '139.59.0.0/16', '146.185.128.0/21', '159.203.0.0/16',
           '162.243.0.0/16', '178.62.0.0/17', '178.62.128.0/17',
           '188.166.0.0/16', '188.166.0.0/17', '188.226.128.0/18',
           '188.226.192.0/18', '45.55.0.0/16', '46.101.0.0/17',
           '46.101.128.0/17', '82.196.8.0/21', '95.85.0.0/21', '95.85.32.0/21'],
    'Dream': ['173.236.128.0/17', '205.196.208.0/20', '208.113.128.0/17',
              '208.97.128.0/18', '67.205.0.0/18'],
    'Google': ['104.154.0.0/15', '104.196.0.0/14', '107.167.160.0/19',
               '107.178.192.0/18', '108.170.192.0/20', '108.170.208.0/21',
               '108.170.216.0/22', '108.170.220.0/23', '108.170.222.0/24',
               '108.59.80.0/20', '130.211.128.0/17', '130.211.16.0/20',
               '130.211.32.0/19', '130.211.4.0/22', '130.211.64.0/18',
               '130.211.8.0/21', '146.148.16.0/20', '146.148.2.0/23',
               '146.148.32.0/19', '146.148.4.0/22', '146.148.64.0/18',
               '146.148.8.0/21', '162.216.148.0/22', '162.222.176.0/21',
               '173.255.112.0/20', '192.158.28.0/22', '199.192.112.0/22',
               '199.223.232.0/22', '199.223.236.0/23', '208.68.108.0/23',
               '23.236.48.0/20', '23.251.128.0/19', '35.184.0.0/14',
               '35.188.0.0/15', '35.190.0.0/17', '35.190.128.0/18',
               '35.190.192.0/19', '35.190.224.0/20', '8.34.208.0/20',
               '8.35.192.0/21', '8.35.200.0/23',],
    'Hetzner': ['129.232.128.0/17', '129.232.156.128/28', '136.243.0.0/16',
                '138.201.0.0/16', '144.76.0.0/16', '148.251.0.0/16',
                '176.9.12.192/28', '176.9.168.0/29', '176.9.24.0/27',
                '176.9.72.128/27', '178.63.0.0/16', '178.63.120.64/27',
                '178.63.156.0/28', '178.63.216.0/29', '178.63.216.128/29',
                '178.63.48.0/26', '188.40.0.0/16', '188.40.108.64/26',
                '188.40.132.128/26', '188.40.144.0/24', '188.40.48.0/26',
                '188.40.48.128/26', '188.40.72.0/26', '196.40.108.64/29',
                '213.133.96.0/20', '213.239.192.0/18', '41.203.0.128/27',
                '41.72.144.192/29', '46.4.0.128/28', '46.4.192.192/29',
                '46.4.84.128/27', '46.4.84.64/27', '5.9.144.0/27',
                '5.9.192.128/27', '5.9.240.192/27', '5.9.252.64/28',
                '78.46.0.0/15', '78.46.24.192/29', '78.46.64.0/19',
                '85.10.192.0/20', '85.10.228.128/29', '88.198.0.0/16',
                '88.198.0.0/20'],
    'Linode': ['104.200.16.0/20', '109.237.24.0/22', '139.162.0.0/16',
               '172.104.0.0/15', '173.255.192.0/18', '178.79.128.0/21',
               '198.58.96.0/19', '23.92.16.0/20', '45.33.0.0/17',
               '45.56.64.0/18', '45.79.0.0/16', '50.116.0.0/18',
               '80.85.84.0/23', '96.126.96.0/19'],
}

I've created a utility function where I can pass in an IPv4 address and a list of CIDR blocks and it'll tell me if the IP address belongs to any of the given CIDR blocks.

def in_block(ip, block):
    _ip = IPAddress(ip)
    return any([True
                for cidr in block
                if _ip in IPNetwork(cidr)])

The following function takes objects of a request and the browser agent used and tries to determine if the source of traffic and/or the browser agent are that of a bot. The browser agent object is put together by the Python User Agents library and it already has some tests for determining if a user agent string is that of a known bot. I've further expanded these tests with a number of tokens I saw slip through the library's classification system. I also iterate through the CIDR blocks to see if the the remote host's IPv4 address is within any of them.

def bot_test(req, agent):
    ua_tokens = ['daum/', # Daum Communications Corp.
                 'gigablastopensource',
                 'go-http-client',
                 'http://',
                 'httpclient',
                 'https://',
                 'libwww-perl',
                 'phantomjs',
                 'proxy',
                 'python',
                 'sitesucker',
                 'wada.vn',
                 'webindex',
                 'wget']

    is_bot = agent.is_bot or \
             any([True
                  for cidr in CIDRS.values()
                  if in_block(req['remote_host'], cidr)]) or \
             any([True
                  for token in ua_tokens
                  if token in agent.ua_string.lower()])

    return is_bot

The following is the main section of the script. Here web traffic logs are read in from stdin line by line. Each line of content is parsed for a tokenised version of the request, user agent and URI being requested. These objects make it easier to work with the data without the complexity of having to parse them on the fly.

I attempt to look up the city and country associated with the IPv4 address using MaxMind's library. If there is any sort of failure these are simply set to None.

After the bot test I prepare the output. If the request is seen to be that of a bot it'll highlight the output with a red background.

if __name__ == '__main__':
    while True:
        try:
            line = sys.stdin.readline()
        except KeyboardInterrupt:
            break

        if not line:
            break

        req = line_parser(line)
        agent = parse(req['request_header_user_agent'])
        uri = urlparse(req['request_url'])

        try:
            response = reader.city(req['remote_host'])
            country, city = response.country.iso_code, response.city.name
        except:
            country, city = None, None

        is_bot = bot_test(req, agent)

        agent_str = ', '.join([item
                               for item in agent.browser[0:3] +
                                           agent.device[0:3] +
                                           agent.os[0:3]
                               if item is not None and
                                  type(item) is not tuple and
                                  len(item.strip()) and
                                  item != 'Other'])

        ip_owner_str = ' '.join([network + ' IP'
                                  for network, cidr in CIDRS.iteritems()
                                  if in_block(req['remote_host'], cidr)])

        print Back.RED + 'b' if is_bot else 'h', \
              country, \
              city, \
              uri.path, \
              agent_str, \
              ip_owner_str, \
              Style.RESET_ALL

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4