reference page done

sensiblecodeio · Feb 16, 2017 · dcbe0e5 · dcbe0e5
1 parent 4ace984
commit dcbe0e5
Showing 1 changed file with 215 additions and 21 deletions.
diff --git a/databaker/tutorial/tutorial_reference.ipynb b/databaker/tutorial/tutorial_reference.ipynb
@@ -745,38 +745,232 @@
    "source": [
     "# Downloading excel and unzipping files\n",
     "\n",
-    "Can be done if you use the libraries"
+    "If you are doing work on a computer that can actually be done by the computer, then you are not doing real work.  \n",
+    "\n",
+    "Please automate the webscraping and unzipping of files.  \n",
+    "\n",
+    "Here are some quick methods for downloading multiple excel spreadsheets linked to from [this page](https://www.ons.gov.uk/businessindustryandtrade/constructionindustry/datasets/outputintheconstructionindustry/current).\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 11,
    "metadata": {
-    "collapsed": true
+    "collapsed": false
    },
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Downloaded a webpage with 31071 bytes\n"
+     ]
+    }
+   ],
    "source": [
-    "# code from Bau_construct that needs tuning down\n",
+    "import urllib, re, os\n",
     "\n",
-    "    ddurl = \"https://www.ons.gov.uk/businessindustryandtrade/constructionindustry/datasets/outputintheconstructionindustry/current\"\n",
-    "    print(\"Loading index page\\n\", ddurl)\n",
-    "    req1 = urllib.request.Request(ddurl, headers={'User-Agent' : \"Sensible code\"}) \n",
-    "    xpage = urllib.request.urlopen(req1).read().decode(\"utf8\")\n",
+    "# url containing the index of a set of spreadsheets\n",
+    "ddurl = \"https://www.ons.gov.uk/businessindustryandtrade/constructionindustry/datasets/outputintheconstructionindustry/current\"\n",
+    "req1 = urllib.request.Request(ddurl, headers={'User-Agent' : \"Sensible code\"}) \n",
+    "dhtml = urllib.request.urlopen(req1).read().decode(\"utf8\")\n",
+    "print(\"Downloaded a webpage with\", len(dhtml), \"bytes\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# make the download directory\n",
+    "dfiles = \"downloaddir\"\n",
+    "if not os.path.isdir(dfiles):\n",
+    "    print(\"making directory\", dfiles)\n",
+    "    os.mkdir(dfiles)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "# quick and dirty regular expression for pullint out the links to relevant xls spreadsheets\n",
+    "xllinklist = re.findall('href=\"(/file\\?uri=/businessindustryandtrade.*?/([^/\"]*\\.xls))\"', dhtml)\n",
     "    \n",
-    "    xll = re.findall('href=\"(/file\\?uri=/businessindustryandtrade.*?/([^/\"]*\\.xls))\"', xpage)\n",
+    "for xl, xln in xllinklist:\n",
+    "    lxln = os.path.join(dfiles, xln)\n",
+    "    if os.path.exists(lxln):\n",
+    "        continue   # <-- we avoid downloading the same file a second time, in this case\n",
+    "    furl = urllib.parse.urljoin(ddurl, xl)\n",
+    "    req = urllib.request.Request(furl, headers={'User-Agent' : \"Sensible code\"}) \n",
+    "    xp = urllib.request.urlopen(req).read()\n",
+    "    print(\"Downloading\", xln, len(xp), \"bytes\")\n",
+    "    fout = open(lxln, \"wb\")\n",
+    "    fout.write(xp)\n",
+    "    fout.close()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "fnames = [ os.path.join(dfiles, f)  for f in os.listdir(dfiles)  if f[-4:] == '.xls' ]\n",
+    "\n",
+    "print(\"Your list of xls files is:\\n\", \"\\n \".join(fnames))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What to do when you have zip files\n",
+    "\n",
+    "If you find yourself downloading zipfiles and manually instructing the computer to unzip each file, you should think about making the computer do the work itself.\n",
+    "\n",
+    "An example of zipfiles containing excel spreadsheets can be found on [this page](https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/workplacepensions/datasets/annualsurveyofhoursandearningspensiontablespensiontypebyagegroupandbygrossweeklyearningsbandsp1).\n",
+    "\n",
+    "First job is to download one of these files, as we did above:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "We are about to download the file:\n",
+      " https://www.ons.gov.uk/file?uri=/employmentandlabourmarket/peopleinwork/workplacepensions/datasets/annualsurveyofhoursandearningspensiontablespensiontypebyagegroupandbygrossweeklyearningsbandsp1/2015/2015provisionaltablep1.zip\n",
+      "downloaded.zip is 44560 bytes long.\n"
+     ]
+    }
+   ],
+   "source": [
+    "import urllib, re\n",
+    "\n",
+    "# fetch the front page and find the link to the zip file we want\n",
+    "iurl = \"https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/workplacepensions/datasets/annualsurveyofhoursandearningspensiontablespensiontypebyagegroupandbygrossweeklyearningsbandsp1\"\n",
+    "req = urllib.request.Request(iurl, headers={'User-Agent' : \"Sensible Code\"}) \n",
+    "ipage = urllib.request.urlopen(req).read()\n",
+    "\n",
+    "# search the link to the zip file and \"join\" against the baseurl to get the full url (there's a space -> %20 bug problem)\n",
+    "zyears = [ urllib.parse.urljoin(iurl, z.replace(\" \", \"%20\"))  for z in re.findall('<a href=\"([^\"]*?\\.zip)\"', str(ipage)) ]\n",
+    "zurl = zyears[0]\n",
+    "\n",
+    "print(\"We are about to download the file:\\n\", zurl)\n",
+    "zfilename = \"downloaded.zip\"\n",
+    "zurl = zurl.replace(\" \", \"%20\")   # spaces in the url get escaped in the browser\n",
+    "req = urllib.request.Request(zurl, headers={'User-Agent' : \"Sensible Code\"}) \n",
+    "zbytes = urllib.request.urlopen(req).read()\n",
     "    \n",
-    "    for xl, xln in xll:\n",
-    "        lxln = os.path.join(xddir, xln)\n",
-    "        if os.path.exists(lxln):\n",
-    "            continue\n",
-    "        print(\"Downloading\", xln)\n",
-    "        furl = urllib.parse.urljoin(ddurl, xl)\n",
-    "        req = urllib.request.Request(furl, headers={'User-Agent' : \"Sensible code\"}) \n",
-    "        xp = urllib.request.urlopen(req).read()\n",
-    "        fout = open(lxln, \"wb\")\n",
-    "        fout.write(xp)\n",
-    "        fout.close()\n"
+    "fout = open(zfilename, \"wb\")\n",
+    "fout.write(zbytes)\n",
+    "fout.close()\n",
+    "print(zfilename, \"is\", len(zbytes), \"bytes long.\")    \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The files in downloaded.zip are:\n",
+      " PROV - Pension Provision by Earnings & Age Group Table P1.1b   Pension Type 2015 CV.xls\n",
+      " PROV - Pension Provision by Earnings & Age Group Table P1.1a   Pension Type 2015.xls\n",
+      "\n",
+      "We have unzipped:\n",
+      " PROV - Pension Provision by Earnings & Age Group Table P1.1b   Pension Type 2015 CV.xls \n",
+      "and saved it as downloaded0.xls with 83968 bytes\n"
+     ]
+    }
+   ],
+   "source": [
+    "import zipfile\n",
+    "\n",
+    "zfilename = \"downloaded.zip\"\n",
+    "\n",
+    "# open the zipfile\n",
+    "zdir = zipfile.ZipFile(zfilename)\n",
+    "\n",
+    "print(\"The files in\", zfilename, \"are:\\n\", \"\\n \".join(zdir.namelist()))\n",
+    "\n",
+    "zmember0 = zdir.namelist()[0]\n",
+    "\n",
+    "xlsfilename = \"downloaded0.xls\"\n",
+    "fout = open(xlsfilename, \"wb\")\n",
+    "xlsbindata = zdir.read(zmember0)\n",
+    "fout.write(xlsbindata)   \n",
+    "fout.close()\n",
+    "\n",
+    "print()\n",
+    "print(\"We have unzipped:\\n\", zmember0, \"\\nand saved it as\", xlsfilename, \"with\", len(xlsbindata), \"bytes\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading downloaded0.xls which has size 83968 bytes\n",
+      "Table names: ['CV notes', 'All', 'Male', 'Female']\n"
+     ]
+    }
+   ],
+   "source": [
+    "# now we can load this file into databaker and continue with our work\n",
+    "from databaker.framework import *\n",
+    "tabs = loadxlstabs(xlsfilename)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Final Automation Notes\n",
+    "\n",
+    "The processes for downloading, saving and extracting the excel files from places on the web might appear complex, but are in fact quite simple.  They are broken down into nothing more than opening files, listing files, reading files and saving files.  \n",
+    "\n",
+    "What is very complex is being organized at deciding where to copy the files, knowling what names they should have, and keeping track of what are the new files versus the ones that have already been seen or are even possibly being over-written.  \n",
+    "\n",
+    "There's no obvious answer to this problem, because it depends on the consistency and form of the source pages -- which may not be as consistent as you'd like them to be.  The first step, however, is to take this file management issue seriously and give it the design and thought that it requires.  And be prepared to look at it again in the event that your first attempt at automation turns out to be more burdensome than necessary.\n",
+    "\n",
+    "Finally, using the same functions here of `urllib.request.Request` and so forth, it's possible for the code in a notebook to POST the processed results back into a webservice further down the pipeline so that no one needs to touch this script by hand.  In this case it is important to handle the error generation and return any messages about the consistency of the input files to the place where the file was generated in order for fixes to happen as soon as possible while the file is live.\n"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {