How IP to Country Lookup Work Under the Hood

Parsing RIR Exchange Format using Python

Posted 2019-05-12 13:19:01 by Ronie Martinez

Whenever you visit a web site, you will be surprised that the site you are in knows what country you are from. How? Every time you access the internet, web servers will know your IP address. IP or Internet Protocol addresses provide a way to find devices on the internet. This is analogous to a home address. When you send a mail from A to B, you need to provide the addresses of the origin and the destination of the mail. This way, the receiver will know where the mail came from and where to send replies.

How are IP addresses linked to a country? Each country has a specific set of IP addresses that they can use. Allocation and registration of IP addresses is managed by an organization called Regional Internet Registry (RIR). Currently, there are 5 RIRs managing 5 regions around the world.

  • African Network Information Center (AFRINIC) for Africa
  • American Registry for Internet Numbers (ARIN) for Antarctica, Canada, parts of the Caribbean, and the United States
  • Asia-Pacific Network Information Centre (APNIC) for East Asia, Oceania, South Asia, and Southeast Asia
  • Latin America and Caribbean Network Information Centre (LACNIC) for most of the Caribbean and all of Latin America
  • Réseaux IP Européens Network Coordination Centre (RIPE NCC) for Europe, Central Asia, Russia, and West Asia

RIR Statistics Exchange Format

These RIRs hold a list of IP addresses available to the public and you can find them on their specific FTP servers. These IP addresses are stored in a delimiter-separated values format that follows a standard called RIR Statistics Exchange Format.

Downloading RIR files

These FTP servers contain a lot of files that are associated to IP address allocation. The specific file that contains all data is named delegated-<rir>-extended-latest, where <rir> can be any of the 5 RIRs (afrinic, arin, apnic, lacnic, ripencc). In addition, the integrity or correctness of these files can be calculated using MD5 which can be downloaded and compared to the contents of a file named delegated-<rir>-extended-latest.md5.

First step is to download, these files to a local directory. The following code uses Python to download, update and verify these files. We used hashlib for calculating MD5 and urllib, specifically, to calculate while downloading. Call update_rir_database() on each of the above URLs.

#!/usr/bin/env python
import hashlib
import logging
import os
import sys

try:
    from urllib.request import urlopen, urlretrieve
except ImportError:
    from urllib import urlopen, urlretrieve

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logger = logging.getLogger(__name__)

DESTINATION = os.path.expanduser('~/.rir')


def update_rir_database(rir_database_url):
    global DESTINATION
    try:
        os.mkdir(DESTINATION)
    except FileExistsError:
        pass
    rir_database_path = os.path.join(DESTINATION, rir_database_url.split('/')[-1])
    try:
        if os.path.isfile(rir_database_path):
            hash_md5 = hashlib.md5()
            calculate_hash(hash_md5, rir_database_path)
            md5_text = urlopen(rir_database_url + '.md5').read().decode('utf-8')
            calculated_md5 = hash_md5.hexdigest()
            if not (calculated_md5 != md5_text[-33:-1] or calculated_md5 != md5_text[:32]):
                logger.info("Updating RIR database: {}".format(rir_database_url))
                urlretrieve(rir_database_url, filename=rir_database_path)
                logger.info("RIR database updated: {}".format(rir_database_url))
            else:
                logger.info("RIR database is up-to-date: {}".format(rir_database_path))
        else:
            logger.info("Downloading RIR database {}".format(rir_database_path))
            urlretrieve(rir_database_url, filename=rir_database_path)
            logger.info("RIR database downloaded: {}".format(rir_database_url))
    except IOError as e:
        logger.exception(e)


def calculate_hash(hash_md5, path):
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            hash_md5.update(chunk

Parsing RIR Statistics Exchange Format

The following lines are extracted from delegated-apnic-extended-latest. The first line is called the file header which contains the information about this record. The 2nd, 3rd and 4th lines are the summary lines which contain a count of all records for each protocol (asn, ipv4, ipv6). IP and country code mapping can be found on the 4th and succeeding lines. The records follow the format registry|cc|type|start|value|date|status[|extensions...]. We will use the columns cc, type, start and value to calculate IP to Country Code.

2.3|apnic|20190512|120714||20190510|+1000
apnic|*|asn|*|9695|summary
apnic|*|ipv4|*|43824|summary
apnic|*|ipv6|*|67195|summary
apnic|AU|ipv4|1.0.0.0|256|20110811|assigned|A91872ED
apnic|CN|ipv4|1.0.1.0|256|20110414|allocated|A92E1062
apnic|CN|ipv4|1.0.2.0|512|20110414|allocated|A92E1062
apnic|AU|ipv4|1.0.4.0|1024|20110412|allocated|A9192210
apnic|CN|ipv4|1.0.8.0|2048|20110412|allocated|A92319D5
apnic|JP|ipv4|1.0.16.0|4096|20110412|allocated|A92D9378

Reading each file is as simple as reading CSV using pipes (|) as delimiter. For this we can use the csv module or pandas. The following code uses pandas to extract the records. Note that we should merge all 5 database into one dataframe.

headers = ['Registry', 'Country Code', 'Type', 'Start', 'Value', 'Date', 'Status', 'Extensions']
rir_database = pandas.read_csv(rir_database_path, delimiter='|', comment='#', names=headers, dtype=str, keep_default_na=False, na_values=[''], encoding='utf-8')[4:]

IPv4 to Country Code

To find the country code of an IPv4 address, We need to check if the address is within the range of the Start column plus Value.

ipv4_database = rir_database[rir_database['Type'] == 'ipv4']
for index, row in ipv4_database.iterrows():
    start_address = IPv4Address(row['Start'])
    if start_address <= address < start_address + int(row['Value'])
        print(row['Country Code'])
        break

IPv6 to Country Code

IPv6 is calculated differently. The Start and Value columns comprise an IPv6 network.

ipv6_database = rir_database[rir_database['Type'] == 'ipv6']
for index, row in ipv6_database.iterrows():
    if address in IPv6Network(row['Start'] + '/' + row['Value']):
        print(row['Country Code'])
        break

Bonus: IPToCC

I have been maintaining IPToCC library for while now. The implementations discussed here are all part of the library. To install IPToCC, execute the following pip command.

pip install IPToCC

To use IPToCC in a Python script, import and call get_country_code().

from iptocc import get_country_code
country_code = get_country_code('<IPv4/IPv6 address>')

Conclusion

This article only demonstrates how IP addresses are downloaded from RIR servers and parsed. Most IP to country code lookup services and APIs implement different algorithms for storage and faster lookup. RIR databases only contain country codes. It is required to obtain the name of the country from a separate mapping, like this iso3166.csv file that can be downloaded from MaxMind.

python iptocc urllib hashlib


Share