{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# AWK tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See https://github.com/tdhopper/awk-lessons for inspiration\n",
    "\n",
    "## Lesson 01: Basics of Awk\n",
    "\n",
    "If you haven't read the Awk man page, you should start there. It's helpful! Some highlights:\n",
    "\n",
    "```\n",
    "awk − pattern-directed scanning and processing language\n",
    "\n",
    "awk [ −F fs ] [ −v var=value ] [ ’prog’ | −f progfile ] [ file ... ]\n",
    "```\n",
    "\n",
    "Awk scans each input file for lines that match any of a set of patterns specified literally in prog or in one or more files specified as −f progfile.\n",
    "\n",
    "With each pattern there can be an associated action that will be performed when a line of a file matches the pattern.\n",
    "\n",
    "Each line is matched against the pattern portion of every pattern-action statement; the associated action is performed for each matched pattern\n",
    "\n",
    "A pattern-action statement has the form pattern {action}.\n",
    "\n",
    "A missing { action } means print the line; a missing pattern always matches.\n",
    "\n",
    "I created an simple example file to demonstrate basic Awk:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a\n",
      "bb\n",
      "ccc\n",
      "dddd\n",
      "ggg\n",
      "hh\n",
      "i"
     ]
    }
   ],
   "source": [
    "cat data/letters.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A Basic Pattern\n",
    "\n",
    "If we match lines longer than two characters and use the implicit print action, we get:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "bb\n",
      "ccc\n",
      "dddd\n",
      "ggg\n",
      "hh\n"
     ]
    }
   ],
   "source": [
    "awk 'length $0 > 2' data/letters.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$0 is a built-in variable that contains the line.\n",
    "\n",
    "### A Basic Function\n",
    "\n",
    "If we leave out a pattern, we will match every line. A trivial action would be to print each line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a\n",
      "bb\n",
      "ccc\n",
      "dddd\n",
      "ggg\n",
      "hh\n",
      "i\n",
      "\n",
      "Using the length function as our action, we can get the length of each line:\n",
      "\n",
      "1\n",
      "2\n",
      "3\n",
      "4\n",
      "3\n",
      "2\n",
      "1\n"
     ]
    }
   ],
   "source": [
    "awk '{ print }' data/letters.txt\n",
    "\n",
    "echo\n",
    "echo Using the length function as our action, we can get the length of each line:\n",
    "echo \n",
    "\n",
    "awk '{ print length }' data/letters.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1a\n",
      "2bb\n",
      "3ccc\n",
      "4dddd\n",
      "3ggg\n",
      "2hh\n",
      "1i\n",
      "\n",
      "The above prints length of line and line - the value of $0\n",
      "\n",
      "1 a\n",
      "2 bb\n",
      "3 ccc\n",
      "4 dddd\n",
      "3 ggg\n",
      "2 hh\n",
      "1 i\n",
      "\n",
      "Using , as separator puts whitespace\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# We can combine things\n",
    "awk '{ print length $0}' data/letters.txt\n",
    "\n",
    "echo \n",
    "echo The above prints length of line and line - the value of \\$0\n",
    "echo\n",
    "\n",
    "awk '{ print length,$0}' data/letters.txt\n",
    "echo \n",
    "echo Using \",\" as separator puts whitespace \n",
    "echo\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HI\n",
      "a\n",
      "bb\n",
      "ccc\n",
      "dddd\n",
      "ggg\n",
      "hh\n",
      "i\n",
      "BYE!\n"
     ]
    }
   ],
   "source": [
    "# Awk has special controls for executing some code before the file input begins and after it is complete.\n",
    "\n",
    "awk 'BEGIN { print \"HI\" } { print $0 } END { print \"BYE!\" }' data/letters.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Don't Panic! \n"
     ]
    }
   ],
   "source": [
    "awk \"BEGIN { print \\\"Don't Panic! \\\" }\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining Patterns and Functions\n",
    "Of course, patterns and functions can be combined so that the function is only applied when the pattern is matched.\n",
    "\n",
    "From the man page:\n",
    "```\n",
    "A pattern-action statement has the form\n",
    "\n",
    "pattern { action }\n",
    "```\n",
    "\n",
    "We can print the length of all lines longer than 2 characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n",
      "4\n",
      "3\n"
     ]
    }
   ],
   "source": [
    "awk 'length($0) > 2 { print length($0) }' data/letters.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Short: 1\n",
      "Long:  3\n",
      "Long:  4\n",
      "Long:  3\n",
      "Short: 1\n"
     ]
    }
   ],
   "source": [
    "# Actually, we don't have to limit Awk to just one pattern! \n",
    "# We can have arbitrarily many patterns separated by a semicolon or a new line:\n",
    "\n",
    "awk 'length($0) > 2 { print \"Long:  \" length($0) }; length($0) < 2 { print \"Short: \" length($0) }' data/letters.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple Fields\n",
    "\n",
    "Awk is designed for easy handling of data with multiple fields per row. \n",
    "The field delimiter can be specified with the -F option.\n",
    "\n",
    "Here's a simple space-delimited file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Roses are red,\n",
      "Violets are blue,\n",
      "Sugar is sweet,\n",
      "And so are you.\n"
     ]
    }
   ],
   "source": [
    "cat data/field_data.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "are\n",
      "are\n",
      "is\n",
      "so\n",
      "\n",
      "are\n",
      "are\n",
      "is\n",
      "so\n"
     ]
    }
   ],
   "source": [
    "# If we specify the field seperator, we can print the second field from each row:\n",
    "\n",
    "awk -F \" \" '{ print $2 }' data/field_data.txt\n",
    "echo\n",
    "\n",
    "# which is also a default\n",
    "awk '{ print $2 }' data/field_data.txt\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\n",
      "you.\n"
     ]
    }
   ],
   "source": [
    "# We don't get an error if a line doesn't have the referenced field; it just shows up as blank\n",
    "awk -F \" \" '{ print $4 }' data/field_data.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Field 1: Roses \n",
      "Field 2: red,\n",
      "Field 1: Violets \n",
      "Field 2: blue,\n",
      "Field 1: Sugar \n",
      "Field 2: sweet,\n",
      "Field 1: And \n",
      "Field 2: you.\n"
     ]
    }
   ],
   "source": [
    "# The seperator expression is interpreted as a regular expression.\n",
    "\n",
    "awk -F \"((so )?are|is) \" '{print \"Field 1: \" $1 \"\\nField 2: \" $2}' data/field_data.txt\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regular Expressions\n",
    "\n",
    "Patterns can be regular expressions, not just built-in functions. From the man page:\n",
    "\n",
    "Regular expressions are as defined in re_format(7) - \n",
    "Isolated regular expressions in a pattern apply to the entire line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 1172\n",
      "drwxr-xr-x  4 jovyan users    128 Oct 20 17:01 .\n",
      "drwxr-xr-x 17 jovyan users    544 Oct 20 18:21 ..\n",
      "-rw-r--r--  1 jovyan users 256374 Oct 20 16:59 8927565-d9783627c731268fb2935a731a618aa8e95cf465.zip\n",
      "-rw-r--r--  1 jovyan users 938847 Feb 11  2014 words\n"
     ]
    }
   ],
   "source": [
    "ls -la dict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gooier\n",
      "gooiest\n",
      "queue\n",
      "queue's\n",
      "queued\n",
      "queues\n",
      "queuing\n"
     ]
    }
   ],
   "source": [
    "awk '/^[a-z][aeiou]{4}/' dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using simple filters\n",
    "From GNU manual - https://www.gnu.org/software/gawk/manual/html_node/Very-Simple.html#Very-Simple"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Amelia       555-5553     amelia.zodiacusque@gmail.com    F\n",
      "Broderick    555-0542     broderick.aliquotiens@yahoo.com R\n",
      "Julie        555-6699     julie.perscrutabor@skeeve.com   F\n",
      "Samuel       555-3430     samuel.lanceolis@shu.edu        A\n"
     ]
    }
   ],
   "source": [
    "# print all lines contaning 'li'\n",
    "awk '/li/ { print $0 }' data/mail_data.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "23\n"
     ]
    }
   ],
   "source": [
    "# Print length of longest word in dictionary\n",
    "\n",
    "awk '{ if (length($0) > max) max = length($0) }\n",
    "     END { print max }' dict/words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The raw data\n",
      "total 16\n",
      "-rw-r--r-- 1 jovyan users  65 Oct 20 16:45 field_data.txt\n",
      "-rw-r--r-- 1 jovyan users 320 Oct 20 22:04 inventory_shipped.txt\n",
      "-rw-r--r-- 1 jovyan users  22 Oct 20 16:25 letters.txt\n",
      "-rw-r--r-- 1 jovyan users 659 Oct 20 21:59 mail_data.txt\n",
      "===========\n",
      "total bytes: 1066\n",
      "total K-bytes: 1.04102\n"
     ]
    }
   ],
   "source": [
    "# print total number of bytes used by files\n",
    "\n",
    "\n",
    "echo The raw data\n",
    "ls -l data          \n",
    "echo ===========\n",
    "ls -l data | awk '{ x += $5 }\n",
    "                   END { print \"total bytes: \" x }'\n",
    "                   \n",
    "ls -l data | awk '{ x += $5 }\n",
    "   END { print \"total K-bytes:\", x / 1024 }'                   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "_apt\n",
      "backup\n",
      "bin\n",
      "daemon\n",
      "games\n",
      "gnats\n",
      "irc\n",
      "jovyan\n",
      "list\n",
      "lp\n",
      "mail\n",
      "man\n",
      "news\n",
      "nobody\n",
      "proxy\n",
      "root\n",
      "sync\n",
      "sys\n",
      "uucp\n",
      "www-data\n"
     ]
    }
   ],
   "source": [
    "# Print a sorted list of the login names of all users:\n",
    "awk -F: '{ print $1 }' /etc/passwd | sort"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20\n"
     ]
    }
   ],
   "source": [
    "# Count the number of lines\n",
    "awk 'END { print NR }' /etc/passwd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Anthony      555-3412     anthony.asserturo@hotmail.com   A\n",
      "Bill         555-1675     bill.drowning@hotmail.com       A\n",
      "Camilla      555-2912     camilla.infusarum@skynet.be     R\n",
      "Julie        555-6699     julie.perscrutabor@skeeve.com   F\n",
      "Samuel       555-3430     samuel.lanceolis@shu.edu        A\n",
      "\n",
      "Full file\n",
      "Amelia       555-5553     amelia.zodiacusque@gmail.com    F\n",
      "Anthony      555-3412     anthony.asserturo@hotmail.com   A\n",
      "Becky        555-7685     becky.algebrarum@gmail.com      A\n",
      "Bill         555-1675     bill.drowning@hotmail.com       A\n",
      "Broderick    555-0542     broderick.aliquotiens@yahoo.com R\n",
      "Camilla      555-2912     camilla.infusarum@skynet.be     R\n",
      "Fabius       555-1234     fabius.undevicesimus@ucb.edu    F\n",
      "Julie        555-6699     julie.perscrutabor@skeeve.com   F\n",
      "Martin       555-6480     martin.codicibus@hotmail.com    A\n",
      "Samuel       555-3430     samuel.lanceolis@shu.edu        A\n",
      "Jean-Paul    555-2127     jeanpaul.campanorum@nyu.edu     R"
     ]
    }
   ],
   "source": [
    "# print even numbered lines from file\n",
    "\n",
    "awk 'NR % 2 == 0' data/mail_data.txt\n",
    "echo\n",
    "echo 'Full file'\n",
    "cat data/mail_data.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using more than one rule / file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Anthony      555-3412     anthony.asserturo@hotmail.com   A\n",
      "Camilla      555-2912     camilla.infusarum@skynet.be     R\n",
      "Fabius       555-1234     fabius.undevicesimus@ucb.edu    F\n",
      "Jean-Paul    555-2127     jeanpaul.campanorum@nyu.edu     R\n",
      "Jean-Paul    555-2127     jeanpaul.campanorum@nyu.edu     R\n",
      "Jan  21  36  64 620\n",
      "Apr  21  70  74 514\n"
     ]
    }
   ],
   "source": [
    "# The awk utility reads the input files one line at a time. \n",
    "# For each line, awk tries the patterns of each rule. \n",
    "# If several patterns match, then several actions execute in the order in which they appear in the awk program. \n",
    "# If no patterns match, then no actions run.\n",
    "\n",
    "\n",
    "awk '/12/ { print $0 }\n",
    "/21/ { print $0 }' data/mail_data.txt data/inventory_shipped.txt\n",
    "\n",
    "# Note how the line beginning with ‘Jean-Paul’ in mail-list was printed twice, once for each rule."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "566520\n"
     ]
    }
   ],
   "source": [
    "# Sum of size of all files modified in November\n",
    "ls -l /usr/bin | awk '$6 == \"Nov\" { sum += $5 }\n",
    "             END { print sum }'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Environment variables\n",
    "See https://www.gnu.org/software/gawk/manual/html_node/Environment-Variables.html#Environment-Variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AWKPATH=\n",
      "AWKLIBPATH=\n"
     ]
    }
   ],
   "source": [
    "echo AWKPATH=$AWKPATH\n",
    "echo AWKLIBPATH=$AWKLIBPATH"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using include files\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BEGIN {\n",
      "    print \"This is script test1.\"\n",
      "}"
     ]
    }
   ],
   "source": [
    "cat test1.awk\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is script test1.\n",
      "This is script test2.\n"
     ]
    }
   ],
   "source": [
    "awk '@include \"test1.awk\" \n",
    "BEGIN {\n",
    "    print \"This is script test2.\"\n",
    "}'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Expressions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9, 11, 17\n",
      "---\n",
      "021 is 17\n",
      "018 is 1\n"
     ]
    }
   ],
   "source": [
    "# constants - numerical - octal, decimal, hexa\n",
    "awk 'BEGIN { printf \"%d, %d, %d\\n\", 011, 11, 0x11 }'\n",
    "echo ---\n",
    "\n",
    "# number 8 is not valid in octal - will stop conversion\n",
    "awk 'BEGIN { print \"021 is\", 021 ; print \"018 is\", 018 }'\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "found camels\n",
      "found footed\n",
      "----\n",
      "found camels\n",
      "found footed\n"
     ]
    }
   ],
   "source": [
    "# RegExp constants\n",
    "awk '{ if ($0 ~ /^foote/ || $0 ~ /camels/)\n",
    "    print \"found\", $0 }' dict/words\n",
    "    \n",
    "# Is same as    \n",
    "echo ----\n",
    "awk '{ if (/^foote/ || /camels/)\n",
    "    print \"found\", $0 }' dict/words\n",
    "    \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Passing variables into program\n",
    "\n",
    "The -v option for Awk allows us to pass variables it the program. \n",
    "For example, we could use it to hard code constants."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.1415\n",
      "/home/jovyan/work\n"
     ]
    }
   ],
   "source": [
    "awk -v pi=3.1415 'BEGIN { print pi }'\n",
    "\n",
    "# The $USER will work in terminal, not in Jupyter or in Docker\n",
    "awk -v curdir=$PWD 'BEGIN { print curdir }'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "13\n",
      "15\n",
      "15\n",
      "31\n",
      "16\n",
      "31\n",
      "24\n",
      "15\n",
      "13\n",
      "29\n",
      "20\n",
      "17\n",
      "\n",
      "21\n",
      "26\n",
      "24\n",
      "21\n",
      "555-5553\n",
      "555-3412\n",
      "555-7685\n",
      "555-1675\n",
      "555-0542\n",
      "555-2912\n",
      "555-1234\n",
      "555-6699\n",
      "555-6480\n",
      "555-3430\n",
      "555-2127\n",
      "----\n",
      "Jan\n",
      "Feb\n",
      "Mar\n",
      "Apr\n",
      "May\n",
      "Jun\n",
      "Jul\n",
      "Aug\n",
      "Sep\n",
      "Oct\n",
      "Nov\n",
      "Dec\n",
      "\n",
      "Jan\n",
      "Feb\n",
      "Mar\n",
      "Apr\n",
      "amelia.zodiacusque@gmail.com\n",
      "anthony.asserturo@hotmail.com\n",
      "becky.algebrarum@gmail.com\n",
      "bill.drowning@hotmail.com\n",
      "broderick.aliquotiens@yahoo.com\n",
      "camilla.infusarum@skynet.be\n",
      "fabius.undevicesimus@ucb.edu\n",
      "julie.perscrutabor@skeeve.com\n",
      "martin.codicibus@hotmail.com\n",
      "samuel.lanceolis@shu.edu\n",
      "jeanpaul.campanorum@nyu.edu\n"
     ]
    }
   ],
   "source": [
    "# When is variable set: with -v, at the very beginning\n",
    "awk -v n=2 '{ print $n }' data/inventory_shipped.txt data/mail_data.txt\n",
    "\n",
    "echo ----\n",
    "# But here - in order\n",
    "awk '{ print $n }' n=1 data/inventory_shipped.txt n=3 data/mail_data.txt\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "27\n"
     ]
    }
   ],
   "source": [
    "# How awk Converts Between Strings and Numbers\n",
    "\n",
    "awk 'BEGIN {two = 2; three = 3\n",
    "print (two three) + 4}'\n",
    "\n",
    "# The numeric values of the variables two and three are converted to strings and concatenated together. \n",
    "# The resulting string is converted back to the number 23, to which 4 is then added."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.14159\n",
      "3.14159\n"
     ]
    }
   ],
   "source": [
    "# Locale enforced\n",
    "\n",
    "awk 'BEGIN { printf \"%g\\n\", 3.1415927 }'\n",
    "LC_ALL=en_DK.utf-8 awk 'BEGIN { printf \"%g\\n\", 3.1415927 }'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Operators\n",
    "\n",
    "The following list provides the arithmetic operators in awk, in order from the highest precedence to the lowest:\n",
    "```\n",
    "x ^ y\n",
    "x ** y\n",
    "Exponentiation; x raised to the y power. ‘2 ^ 3’ has the value eight; the character sequence ‘**’ is equivalent to ‘^’. (c.e.)\n",
    "\n",
    "- x\n",
    "Negation.\n",
    "\n",
    "+ x\n",
    "Unary plus; the expression is converted to a number.\n",
    "\n",
    "x * y\n",
    "Multiplication.\n",
    "\n",
    "x / y\n",
    "Division; because all numbers in awk are floating-point numbers, the result is not rounded to an integer—‘3 / 4’ has the value 0.75. (It is a common mistake, especially for C programmers, to forget that all numbers in awk are floating point, and that division of integer-looking constants produces a real number, not an integer.)\n",
    "\n",
    "x % y\n",
    "Remainder; further discussion is provided in the text, just after this list.\n",
    "\n",
    "x + y\n",
    "Addition.\n",
    "\n",
    "x - y\n",
    "Subtraction.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Field number one: Becky\n",
      "Field number one: Bill\n",
      "Field number one: Broderick\n",
      "----\n",
      "Field number one:Becky\n",
      "Field number one:Bill\n",
      "Field number one:Broderick\n"
     ]
    }
   ],
   "source": [
    "# Concatenation\n",
    "\n",
    "#  concatenation is performed by writing expressions next to one another, with no operator. \n",
    "awk '/^B/ { print \"Field number one: \" $1 }' data/mail_data.txt\n",
    "# Without the space in the string constant after the ‘:’, the line runs together. \n",
    "echo ----\n",
    "awk '/^B/ { print \"Field number one:\" $1 }' data/mail_data.txt\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-12-24\n",
      "-12 -24\n"
     ]
    }
   ],
   "source": [
    "# The precedence of concatenation, when mixed with other operators, is often counter-intuitive. Consider this example:\n",
    "awk 'BEGIN { print -12 \" \" -24 }'\n",
    "# But where did the space disappear to?\n",
    "awk 'BEGIN { print -12 \" \" (-24) }'\n",
    "# This forces awk to treat the ‘-’ on the ‘-24’ as unary. Otherwise, it’s parsed as follows\n",
    "##    -12 (\" \" - 24)\n",
    "## ⇒ -12 (0 - 24)\n",
    "## ⇒ -12 (-24)\n",
    "## ⇒ -12-24"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assignment\n",
    "\n",
    "This is OK\n",
    "\n",
    "x = y = z = 5\n",
    "\n",
    "So is this"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Truth values\n",
    "\n",
    "Any nonzero numeric value or any nonempty string value is true."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A strange truth value\n",
      "A strange truth value\n",
      "A strange truth value\n",
      "String zero\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN {\n",
    "   if (3.1415927)\n",
    "       print \"A strange truth value\"\n",
    "   if (\"Four Score And Seven Years Ago\")\n",
    "       print \"A strange truth value\"\n",
    "   if (j = 57)\n",
    "       print \"A strange truth value\"\n",
    "       \n",
    "}'\n",
    "\n",
    "# There is a surprising consequence of the “nonzero or non-null” rule: \n",
    "# the string constant \"0\" is actually true, because it is non-null. \n",
    "\n",
    "awk 'BEGIN {\n",
    "   if (0)\n",
    "       print \"Numerical zero\"\n",
    "   if (\"0\")\n",
    "       print \"String zero\"\n",
    "   if (0 / 123)\n",
    "       print \"expression\"\n",
    "       \n",
    "}'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a is untyped\n",
      "awk: cmd. line:1: fatal: function `typeof' not defined\n"
     ]
    },
    {
     "ename": "",
     "evalue": "2",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "# Speaking of Types\n",
    "awk 'BEGIN { print (a == \"\" && a == 0 ? \"a is untyped\" : \"a has a type\"); print typeof(a); }'\n",
    "#awk 'BEGIN { a = 42 ; print typeof(a); b = a ; print typeof(b); }'\n",
    "\n",
    "# Typeof => from 4.2 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hello is not < 42\n",
      "37 is < 42\n"
     ]
    }
   ],
   "source": [
    "# Since ‘hello’ is alphabetic data, awk can only do a string comparison. \n",
    "# Internally, it converts 42 into \"42\" and compares the two string values \"hello\" and \"42\". Here’s the result:\n",
    "\n",
    "echo hello | awk '{ printf(\"%s %s < 42\\n\", $1,\n",
    "                           ($1 < 42 ? \"is\" : \"is not\")) }'\n",
    "\n",
    "# However, what happens when data from a user looks like a number? On the one hand, in reality, \n",
    "# the input data consists of characters, not binary numeric values. \n",
    "# But, on the other hand, the data looks numeric, and awk really ought to treat it as such. And indeed, it does:\n",
    "echo 37 | awk '{ printf(\"%s %s < 42\\n\", $1,\n",
    "                         ($1 < 42 ? \"is\" : \"is not\")) }'                           "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "0\n",
      "0\n",
      "1\n",
      "0\n",
      "1\n",
      "0\n",
      "1\n"
     ]
    }
   ],
   "source": [
    "echo ' +3.14' | awk '{ print($0 == \" +3.14\") }'    # True\n",
    "echo ' +3.14' | awk '{ print($0 == \"+3.14\") }'     # False\n",
    "echo ' +3.14' | awk '{ print($0 == \"3.14\") }'      # False\n",
    "echo ' +3.14' | awk '{ print($0 == 3.14) }'        # True\n",
    "\n",
    "echo ' +3.14' | awk '{ print($1 == \" +3.14\") }'    # False\n",
    "echo ' +3.14' | awk '{ print($1 == \"+3.14\") }'     # True\n",
    "echo ' +3.14' | awk '{ print($1 == \"3.14\") }'      # False\n",
    "echo ' +3.14' | awk '{ print($1 == 3.14) }'        # True\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hello scalar\n"
     ]
    }
   ],
   "source": [
    "# echo hello 37 | awk '{ for(k in PROCINFO[\"identifiers\"]) print(k, PROCINFO[\"identifiers\"][k])  }'\n",
    "echo hello 37 | awk -v a=\"hello\" -v b=37 '{ print(a, PROCINFO[\"identifiers\"][\"a\"])  }'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.237788 5\n",
      "0.845814 5\n",
      "0.291066 \n"
     ]
    }
   ],
   "source": [
    "# There are situations where using ‘+=’ (or any assignment operator) \n",
    "# is not the same as simply repeating the lefthand operand in the righthand expression. For example:\n",
    "\n",
    "# Thanks to Pat Rankin for this example\n",
    "awk 'BEGIN  {\n",
    "    foo[rand()] += 5\n",
    "    for (x in foo)\n",
    "       print x, foo[x]\n",
    "\n",
    "    bar[rand()] = bar[rand()] + 5\n",
    "    for (x in bar)\n",
    "       print x, bar[x]\n",
    "}'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Jan  13  25  15 115\n",
      "Feb  15  32  24 226\n",
      "Mar  15  24  34 228\n",
      "Apr  31  52  63 420\n",
      "May  16  34  29 208\n",
      "Jun  31  42  75 492\n",
      "Jul  24  34  67 436\n",
      "Aug  15  34  47 316\n",
      "Sep  13  55  37 277\n",
      "Oct  29  54  68 525\n",
      "Nov  20  87  82 577\n",
      "Dec  17  35  61 401\n",
      "\n",
      "Jan  21  36  64 620\n",
      "Feb  26  58  80 652\n",
      "Mar  24  75  70 495\n",
      "Apr  21  70  74 514Jan 17.6667\n",
      "Feb 23.6667\n",
      "Mar 24.3333\n",
      "Apr 48.6667\n",
      "May 26.3333\n",
      "Jun 49.3333\n",
      "Jul 41.6667\n",
      "Aug 32\n",
      "Sep 35\n",
      "Oct 50.3333\n",
      "Nov 63\n",
      "Dec 37.6667\n",
      " 0\n",
      "Jan 40.3333\n",
      "Feb 54.6667\n",
      "Mar 56.3333\n",
      "Apr 55\n"
     ]
    }
   ],
   "source": [
    "cat data/inventory_shipped.txt\n",
    "awk '{ sum = $2 + $3 + $4 ; avg = sum / 3; print $1, avg }' data/inventory_shipped.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HOME \n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN { if (! (\"HOME\" in ENVIRON)) print \"no home!\"; else print(\"HOME\", ENVIRON['HOME']);}'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Patterns\n",
    "\n",
    "Patterns in awk control the execution of rules—a rule is executed when its pattern matches the current input record. The following is a summary of the types of awk patterns:\n",
    "```\n",
    "/regular expression/\n",
    "A regular expression. It matches when the text of the input record fits the regular expression. (See Regexp.)\n",
    "\n",
    "expression\n",
    "A single expression. It matches when its value is nonzero (if a number) or non-null (if a string). (See Expression Patterns.)\n",
    "\n",
    "begpat, endpat\n",
    "A pair of patterns separated by a comma, specifying a range of records. The range includes both the initial record that matches begpat and the final record that matches endpat. (See Ranges.)\n",
    "\n",
    "BEGIN\n",
    "END\n",
    "Special patterns for you to supply startup or cleanup actions for your awk program. (See BEGIN/END.)\n",
    "\n",
    "BEGINFILE\n",
    "ENDFILE\n",
    "Special patterns for you to supply startup or cleanup actions to be done on a per-file basis. (See BEGINFILE/ENDFILE.)\n",
    "\n",
    "empty\n",
    "The empty pattern matches every input record. (See Empty.)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5\n"
     ]
    }
   ],
   "source": [
    "# Using shell variables\n",
    "# Note the shell friendly quoting\n",
    "PATTERN=zoom\n",
    "awk \"/$PATTERN/\"'{nmatches++;} END {print nmatches}' dict/words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Actions\n",
    "An awk program or script consists of a series of rules and function definitions interspersed. (Functions are described later. See User-defined.) A rule contains a pattern and an action, either of which (but not both) may be omitted. The purpose of the action is to tell awk what to do once a match for the pattern is found. Thus, in outline, an awk program generally looks like this:\n",
    "```\n",
    "[pattern]  { action }\n",
    " pattern  [{ action }]\n",
    "…\n",
    "function name(args) { … }\n",
    "…\n",
    "```\n",
    "\n",
    "An action consists of one or more awk statements, enclosed in braces (‘{…}’). Each statement specifies one thing to do. The statements are separated by newlines or semicolons. The braces around an action must be used even if the action contains only one statement, or if it contains no statements at all. However, if you omit the action entirely, omit the braces as well. An omitted action is equivalent to ‘{ print $0 }’:\n",
    "\n",
    "```\n",
    "/foo/  { }     match foo, do nothing — empty action\n",
    "/foo/          match foo, print the record — omitted action\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### If-else Statements\n",
    "If-else statements in Awk are of the form:\n",
    "\n",
    "if (condition) then-body [else else-body]\n",
    "\n",
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 is odd\n",
      "2 is even\n",
      "3 is odd\n",
      "4 is even\n"
     ]
    }
   ],
   "source": [
    "printf \"1\\n2\\n3\\n4\" | awk \\\n",
    "    '{ \\\n",
    "        if ($1 % 2 == 0) print $1, \"is even\"; \\\n",
    "        else print $1, \"is odd\" \\\n",
    "     }'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Looping\n",
    "Awk includes several looping statements: while, do while, and for.\n",
    "\n",
    "They take the expected C-ish syntax."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "1\n",
      "2\n",
      "3\n",
      "4\n"
     ]
    }
   ],
   "source": [
    "awk \\\n",
    "    'BEGIN { \\\n",
    "        i = 0; \\\n",
    "        while (i < 5) { print i; i+=1; } \\\n",
    "     }'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Jan\n",
      "13\n",
      "25\n",
      "Feb\n",
      "15\n",
      "32\n",
      "Mar\n",
      "15\n",
      "24\n",
      "Apr\n",
      "31\n",
      "52\n",
      "May\n",
      "16\n",
      "34\n",
      "Jun\n",
      "31\n",
      "42\n",
      "Jul\n",
      "24\n",
      "34\n",
      "Aug\n",
      "15\n",
      "34\n",
      "Sep\n",
      "13\n",
      "55\n",
      "Oct\n",
      "29\n",
      "54\n",
      "Nov\n",
      "20\n",
      "87\n",
      "Dec\n",
      "17\n",
      "35\n",
      "\n",
      "\n",
      "\n",
      "Jan\n",
      "21\n",
      "36\n",
      "Feb\n",
      "26\n",
      "58\n",
      "Mar\n",
      "24\n",
      "75\n",
      "Apr\n",
      "21\n",
      "70\n"
     ]
    }
   ],
   "source": [
    "awk '\n",
    "{\n",
    "    i = 1\n",
    "    while (i <= 3) {\n",
    "        print $i\n",
    "        i++\n",
    "    }\n",
    "}' data/inventory_shipped.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "1\n",
      "2\n",
      "3\n",
      "4\n"
     ]
    }
   ],
   "source": [
    "awk \\\n",
    "    'BEGIN { \\\n",
    "        i = 0; \\\n",
    "        do { print i; i+=1; } while(i < 5) \\\n",
    "     }'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "1\n",
      "2\n",
      "3\n",
      "4\n"
     ]
    }
   ],
   "source": [
    "awk \\\n",
    "    'BEGIN { \\\n",
    "        i = 0; \\\n",
    "        for(i = 0; i<5; i++) print i \\\n",
    "     }'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)\n",
      "Copyright (C) 1989, 1991-2016 Free Software Foundation.\n",
      "\n",
      "This program is free software; you can redistribute it and/or modify\n",
      "it under the terms of the GNU General Public License as published by\n",
      "the Free Software Foundation; either version 3 of the License, or\n",
      "(at your option) any later version.\n",
      "\n",
      "This program is distributed in the hope that it will be useful,\n",
      "but WITHOUT ANY WARRANTY; without even the implied warranty of\n",
      "MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n",
      "GNU General Public License for more details.\n",
      "\n",
      "You should have received a copy of the GNU General Public License\n",
      "along with this program. If not, see http://www.gnu.org/licenses/.\n"
     ]
    }
   ],
   "source": [
    "awk --version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Roses are red,\n",
      "Violets are blue,\n",
      "Sugar is sweet,\n",
      "And so are you.\n",
      "----\n",
      "data/field_data.txt:1: skipped: NF != 4\n",
      "data/field_data.txt:2: skipped: NF != 4\n",
      "data/field_data.txt:3: skipped: NF != 4\n"
     ]
    }
   ],
   "source": [
    "# next - skip the line\n",
    "cat data/field_data.txt\n",
    "echo ----\n",
    "awk 'NF != 4 {\n",
    "    printf(\"%s:%d: skipped: NF != 4\\n\", FILENAME, FNR) \n",
    "    next\n",
    "}' data/field_data.txt\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Variables\n",
    "\n",
    "## Built-in variables\n",
    "\n",
    "https://www.gnu.org/software/gawk/manual/html_node/User_002dmodified.html#User_002dmodified\n",
    "\n",
    "### FIELDWIDTHS \n",
    "\n",
    "A space-separated list of columns that tells gawk how to split input with fixed columnar boundaries. Starting in version 4.2, each field width may optionally be preceded by a colon-separated value specifying the number of characters to skip before the field starts. Assigning a value to FIELDWIDTHS overrides the use of FS and FPAT for field splitting. See Constant Size for more information.\n",
    "\n",
    "### FPAT \n",
    "\n",
    "A regular expression (as a string) that tells gawk to create the fields based on text that matches the regular expression. Assigning a value to FPAT overrides the use of FS and FIELDWIDTHS for field splitting. See Splitting By Content for more information.\n",
    "\n",
    "### FS\n",
    "\n",
    "The input field separator (see Field Separators). The value is a single-character string or a multicharacter regular expression that matches the separations between fields in an input record. If the value is the null string (\"\"), then each character in the record becomes a separate field. (This behavior is a gawk extension. POSIX awk does not specify the behavior when FS is the null string. Nonetheless, some other versions of awk also treat \"\" specially.)\n",
    "\n",
    "The default value is \" \", a string consisting of a single space. As a special exception, this value means that any sequence of spaces, TABs, and/or newlines is a single separator. It also causes spaces, TABs, and newlines at the beginning and end of a record to be ignored.\n",
    "\n",
    "You can set the value of FS on the command line using the -F option:\n",
    "\n",
    "```\n",
    "awk -F, 'program' input-files\n",
    "```\n",
    "\n",
    "If gawk is using FIELDWIDTHS or FPAT for field splitting, assigning a value to FS causes gawk to return to the normal, FS-based field splitting. An easy way to do this is to simply say ‘FS = FS’, perhaps with an explanatory comment.\n",
    "\n",
    "### IGNORECASE \n",
    "\n",
    "If IGNORECASE is nonzero or non-null, then all string comparisons and all regular expression matching are case-independent. This applies to regexp matching with ‘~’ and ‘!~’, the gensub(), gsub(), index(), match(), patsplit(), split(), and sub() functions, record termination with RS, and field splitting with FS and FPAT. However, the value of IGNORECASE does not affect array subscripting and it does not affect field splitting when using a single-character field separator. See Case-sensitivity.\n",
    "\n",
    "### OFMT\n",
    "A string that controls conversion of numbers to strings (see Conversion) for printing with the print statement. It works by being passed as the first argument to the sprintf() function (see String Functions). Its default value is \"%.6g\". Earlier versions of awk used OFMT to specify the format for converting numbers to strings in general expressions; this is now done by CONVFMT.\n",
    "\n",
    "### OFS\n",
    "The output field separator (see Output Separators). It is output between the fields printed by a print statement. Its default value is \" \", a string consisting of a single space.\n",
    "\n",
    "### ORS\n",
    "The output record separator. It is output at the end of every print statement. Its default value is \"\\n\", the newline character. (See Output Separators.)\n",
    "\n",
    "### RS\n",
    "The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text. (See Records.)\n",
    "\n",
    "The ability for RS to be a regular expression is a gawk extension. In most other awk implementations, or if gawk is in compatibility mode (see Options), just the first character of RS’s value is used.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Built-in variables set by AWK\n",
    "\n",
    "### ARGC, ARGV\n",
    "\n",
    "The command-line arguments available to awk programs are stored in an array called ARGV. ARGC is the number of command-line arguments present. See Other Arguments. Unlike most awk arrays, ARGV is indexed from 0 to ARGC - 1. In the following example:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "awk\n",
      "data/field_data.txt\n",
      "data/inventory_shipped.txt\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN {\n",
    "         for (i = 0; i < ARGC; i++)\n",
    "             print ARGV[i] }' data/field_data.txt data/inventory_shipped.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ARGIND\n",
    "The index in ARGV of the current file being processed. Every time gawk opens a new data file for processing, it sets ARGIND to the index in ARGV of the file name. When gawk is processing the input files, ‘FILENAME == ARGV[ARGIND]’ is always true.\n",
    "\n",
    "This variable is useful in file processing; it allows you to tell how far along you are in the list of data files as well as to distinguish between successive instances of the same file name on the command line.\n",
    "\n",
    "While you can change the value of ARGIND within your awk program, gawk automatically sets it to a new value when it opens the next file.\n",
    "\n",
    "### ENVIRON\n",
    "An associative array containing the values of the environment. The array indices are the environment variable names; the elements are the values of the particular environment variables. For example, ENVIRON[\"HOME\"] might be /home/arnold.\n",
    "\n",
    "For POSIX awk, changing this array does not affect the environment passed on to any programs that awk may spawn via redirection or the system() function.\n",
    "\n",
    "However, beginning with version 4.2, if not in POSIX compatibility mode, gawk does update its own environment when ENVIRON is changed, thus changing the environment seen by programs that it creates. You should therefore be especially careful if you modify ENVIRON[\"PATH\"], which is the search path for finding executable programs.\n",
    "\n",
    "This can also affect the running gawk program, since some of the built-in functions may pay attention to certain environment variables. The most notable instance of this is mktime() (see Time Functions), which pays attention the value of the TZ environment variable on many systems.\n",
    "\n",
    "Some operating systems may not have environment variables. On such systems, the ENVIRON array is empty (except for ENVIRON[\"AWKPATH\"] and ENVIRON[\"AWKLIBPATH\"]; see AWKPATH Variable and see AWKLIBPATH Variable).\n",
    "\n",
    "### ERRNO\n",
    "If a system error occurs during a redirection for getline, during a read for getline, or during a close() operation, then ERRNO contains a string describing the error.\n",
    "\n",
    "In addition, gawk clears ERRNO before opening each command-line input file. This enables checking if the file is readable inside a BEGINFILE pattern (see BEGINFILE/ENDFILE).\n",
    "\n",
    "Otherwise, ERRNO works similarly to the C variable errno. Except for the case just mentioned, gawk never clears it (sets it to zero or \"\"). Thus, you should only expect its value to be meaningful when an I/O operation returns a failure value, such as getline returning -1. You are, of course, free to clear it yourself before doing an I/O operation.\n",
    "\n",
    "If the value of ERRNO corresponds to a system error in the C errno variable, then PROCINFO[\"errno\"] will be set to the value of errno. For non-system errors, PROCINFO[\"errno\"] will be zero.\n",
    "\n",
    "### FILENAME\n",
    "The name of the current input file. When no data files are listed on the command line, awk reads from the standard input and FILENAME is set to \"-\". FILENAME changes each time a new file is read (see Reading Files). Inside a BEGIN rule, the value of FILENAME is \"\", because there are no input files being processed yet.39 (d.c.) Note, though, that using getline (see Getline) inside a BEGIN rule can give FILENAME a value.\n",
    "\n",
    "### FNR\n",
    "The current record number in the current file. awk increments FNR each time it reads a new record (see Records). awk resets FNR to zero each time it starts a new input file.\n",
    "\n",
    "### NF\n",
    "\n",
    "The number of fields in the current input record. NF is set each time a new record is read, when a new field is created, or when $0 changes (see Fields).\n",
    "\n",
    "Unlike most of the variables described in this subsection, assigning a value to NF has the potential to affect awk’s internal workings. In particular, assignments to NF can be used to create fields in or remove fields from the current record. See Changing Fields.\n",
    "\n",
    "### FUNCTAB\n",
    "An array whose indices and corresponding values are the names of all the built-in, user-defined, and extension functions in the program.\n",
    "\n",
    "NOTE: Attempting to use the delete statement with the FUNCTAB array causes a fatal error. Any attempt to assign to an element of FUNCTAB also causes a fatal error.\n",
    "\n",
    "### NR\n",
    "The number of input records awk has processed since the beginning of the program’s execution (see Records). awk increments NR each time it reads a new record.\n",
    "\n",
    "### PROCINFO\n",
    "The elements of this array provide access to information about the running awk program. The following elements (listed alphabetically) are guaranteed to be available:\n",
    "\n",
    "https://www.gnu.org/software/gawk/manual/html_node/Auto_002dset.html#Auto_002dset\n",
    "\n",
    "PROCINFO[\"identifiers\"]\n",
    "A subarray, indexed by the names of all identifiers used in the text of the awk program. An identifier is simply the name of a variable (be it scalar or array), built-in function, user-defined function, or extension function. \n",
    "\n",
    "PROCINFO[\"pgrpid\"]\n",
    "The process group ID of the current process.\n",
    "\n",
    "PROCINFO[\"pid\"]\n",
    "The process ID of the current process.\n",
    "\n",
    "PROCINFO[\"ppid\"]\n",
    "The parent process ID of the current process.\n",
    "\n",
    "PROCINFO[\"strftime\"]\n",
    "The default time format string for strftime(). Assigning a new value to this element changes the default. See Time Functions.\n",
    "\n",
    "PROCINFO[\"uid\"]\n",
    "The value of the getuid() system call.\n",
    "\n",
    "PROCINFO[\"version\"]\n",
    "The version of gawk.\n",
    "\n",
    "The following additional elements in the array are available to provide information about the MPFR and GMP libraries if your version of gawk supports arbitrary-precision arithmetic (see Arbitrary Precision Arithmetic):\n",
    "\n",
    "PROCINFO[\"gmp_version\"]\n",
    "The version of the GNU MP library.\n",
    "\n",
    "PROCINFO[\"mpfr_version\"]\n",
    "The version of the GNU MPFR library.\n",
    "\n",
    "PROCINFO[\"prec_max\"]\n",
    "The maximum precision supported by MPFR.\n",
    "\n",
    "PROCINFO[\"prec_min\"]\n",
    "The minimum precision required by MPFR.\n",
    "\n",
    "The following additional elements in the array are available to provide information about the version of the extension API, if your version of gawk supports dynamic loading of extension functions (see Dynamic Extensions):\n",
    "\n",
    "PROCINFO[\"api_major\"]\n",
    "The major version of the extension API.\n",
    "\n",
    "PROCINFO[\"api_minor\"]\n",
    "The minor version of the extension API."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### RLENGTH\n",
    "The length of the substring matched by the match() function (see String Functions). RLENGTH is set by invoking the match() function. Its value is the length of the matched string, or -1 if no match is found.\n",
    "\n",
    "### RSTART\n",
    "The start index in characters of the substring that is matched by the match() function (see String Functions). RSTART is set by invoking the match() function. Its value is the position of the string where the matched substring starts, or zero if no match was found.\n",
    "\n",
    "### RT\n",
    "The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.\n",
    "\n",
    "### SYMTAB\n",
    "An array whose indices are the names of all defined global variables and arrays in the program. SYMTAB makes gawk’s symbol table visible to the awk programmer. It is built as gawk parses the program and is complete before the program starts to run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4\n"
     ]
    }
   ],
   "source": [
    "awk 'BEGIN { foo = 5; SYMTAB[\"foo\"] = 4; print foo }' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5\n"
     ]
    }
   ],
   "source": [
    "# You may use an index for SYMTAB that is not a predefined identifier:\n",
    "\n",
    "awk 'BEGIN { SYMTAB[\"xxx\"] = 5\n",
    "print SYMTAB[\"xxx\"] }'\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The answer is 42\n"
     ]
    }
   ],
   "source": [
    "awk '\n",
    "# Indirect multiply of any variable by amount, return result\n",
    "\n",
    "function multiply(variable, amount)\n",
    "{\n",
    "    return SYMTAB[variable] *= amount\n",
    "}\n",
    "\n",
    "BEGIN {\n",
    "    answer = 10.5\n",
    "    multiply(\"answer\", 4)\n",
    "    print \"The answer is\", answer\n",
    "}\n",
    "'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 139,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "17\n",
      "18\n",
      "19\n"
     ]
    }
   ],
   "source": [
    "# changing NR\n",
    "\n",
    "echo 'a\n",
    "b\n",
    "c\n",
    "d' | awk 'NR == 2 { NR = 17 };\n",
    " { print NR }'"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}