Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serve MDN sitemaps, other media files #445

Closed
jwhitlock opened this issue Aug 23, 2017 · 20 comments
Closed

Serve MDN sitemaps, other media files #445

jwhitlock opened this issue Aug 23, 2017 · 20 comments
Labels

Comments

@jwhitlock
Copy link
Contributor

The sitemaps are initiated by celerybeat, generated in background tasks, and written to MEDIA_ROOT. They are low traffic, but users include search crawlers, so they are important for SEO.

@jwhitlock jwhitlock added this to the MDN AWS hosting milestone Aug 23, 2017
@bookshelfdave
Copy link
Contributor

What needs to be done on the infra side of things? We already have an EFS volume mounted at /mdn in kuma containers. Multiple pods writing the same content may end up being problematic.

@jwhitlock
Copy link
Contributor Author

Maybe nothing, but I'm not familiar with what you and @escattone did for attachments.

In SCL3, MEDIA_ROOT is the /media folder in the kuma source tree, which uses soft links to map to NFS folders and files:

  • media
    • attachments/ -> NFS/attachments/
    • sitemaps/ -> NFS/sitemaps/
    • sitemap.xml -> NFS/sitemap.xml
    • humans.txt -> NFS/humans.txt
    • revision.txt (Created by deployments)
    • kumascript-revision.txt (Created by deployment process)
    • robots.txt / robots-go-away.txt (Chosen by Apache config)

Looking at the settings, you are pointing MEDIA_ROOT at the attachments share. If that share has an attachments subfolder (/mdn/www/attachments/attachments), then there is probably nothing to do. The sitemap will be written to /mdn/www/attachments/sitemap.xml and be happy. If that's where the attachments are stored, we'll need another process or share for those files.

@bookshelfdave
Copy link
Contributor

The current directory structure for EFS is as follows:

./mdn/
|-- test
|   |-- attachments
|   |-- diagrams
|   |-- presentations
|   `-- samples
`-- www
    |-- attachments
    |   |-- 2012
    |   |-- 2013
    |   |-- 2014
    |   |-- 2015
    |   |-- 2016
    |   `-- 2017
    |-- diagrams
    |   `-- workflow
    |-- presentations
    |   |-- eich-ajax-experience-2007
    |   |-- eich-media-ajax-2007
    |   |-- javascript2
    |   |-- microsummaries
    |   |-- old-javascript2
    |   |-- oscon2005
    |   |-- screencasts
    |   |-- seneca
    |   |-- sxsw2007
    |   |-- xtech2005
    |   `-- xtech2006
    `-- samples
        |-- StockTicker
        |-- audio
        |-- browser-logos
        |-- canvas-tutorial
        |-- cssref
        |-- domref
        |-- dragdrop
        |-- extension-samples
        |-- filestash
        |-- html
        |-- js-animation
        |-- raycaster
        |-- rich-text-editor
        |-- soap
        |-- sse
        |-- svg
        |-- video
        |-- webgl
        |-- workers
        |-- xpcom
        |-- xulrunner
        |-- xultemp
        `-- xultu

I could update the current sync cronjob from SCL3 to copy the additional files into the correct locations.

@jwhitlock jwhitlock changed the title Serve MDN sitemaps Serve MDN sitemaps, other media files Aug 23, 2017
@jwhitlock
Copy link
Contributor Author

I think there's just a few additional files we need to serve:

  • sitemap.xml (generated)
  • sitemaps/* (generated)
  • humans.txt (generated)
  • revision.txt (commit hash at deploy)
  • kumascript-revision.txt (commit hash at deploy)

I think eventually we'll want to convert these to Django views, maybe with caching. I want to talk to @escattone to see if he agrees and thinks it's a day or less of work. If that's the case, it's a code task, and maybe an extra step in Docker image provisioning for the commit hashes.

If not, we may want to host these files in a new folder. /mdn/www/media or similar, to divide them from the attachments.

@escattone
Copy link
Contributor

escattone commented Aug 29, 2017

Thanks @jwhitlock! I hadn't yet considered these files. Here are my thoughts so far:

  • The setting that I used for my dev work for KUMA_MEDIA_ROOT was /mdn/www, but I see that I have not yet updated the default setting (in the Makefile) as well as the stage and prod settings. That's my first action item related to this: update KUMA_MEDIA_ROOT to /mdn/www in all places.
  • In terms of saving the sitemap.xml, sitemaps/*, and humans.txt files, we're good to go as they'll be generated and saved immediately under /mdn/www (KUMA_MEDIA_ROOT). However, in terms of serving those files, I'll add Django views for each of those (in SCL3 Apache currently rewrites those requests to /media/... on the way in, so I won't worry about these new Django views being "live" in SCL3).
  • I'll generate revision.txt within the Kuma Dockerfile, save it in /app/media, and create a Django view to serve /revision.txt with /app/media/revision.txt.
  • Hmm, I'm not sure about the best way to handle kumascript-revision.txt. If I generate it within the Kumascript Dockerfile, which seems best, I'm not sure how to serve it from the web service whose kuma-based container's kumascript submodule may not be the same as the kumascript service's kumascript-based container, right? Should we serve it from the kumascript service, redirecting the request there from the web service have the web service serve it by getting the value from the kumascript service (adding an endpoint to the kumascript service)?
  • I'll create a Django view for /robots.txt that serves either /app/media/robots.txt or /app/media/robots-go-away.txt, depending on whether we're production or staging respectively (again, I won't worry about this view being deployed "live" in SCL3, as Apache will handle it before Django would ever see the request)

@escattone
Copy link
Contributor

Submitted #455 to address KUMA_MEDIA_ROOT.

@escattone
Copy link
Contributor

It looks like humans.txt is already handled (HUMANSTXT_ROOT is set to the value of MEDIA_ROOT by default).

@escattone
Copy link
Contributor

After further reflection, I'm thinking the best way to handle the revision.txt and kumascript-revision.txt files might be via a pod.beta.kubernetes.io/init-containers annotation in both the kuma and kumascript k8s deployments, where I'll generate each file and then copy it to KUMA_MEDIA_ROOT. Sorry, I don't know why that didn't come to me yesterday!

@escattone
Copy link
Contributor

I'm going to use k8s' new initContainers feature available since 1.6 (which replaces the pod.beta.kubernetes.io/init-containers annotation), which seems like the perfect place to do the revision stuff. Excited to use the new feature! (small minds are easily excited! 😄 )

@escattone
Copy link
Contributor

Ok, after some work, I see now that initContainers, although an awesome feature, isn't a good fit for this since it would be run for each container, and we want to generate the revision files and perform the database migrations only once per deployment. I suspect a K8s Job is the way to go.

@escattone
Copy link
Contributor

escattone commented Aug 30, 2017

I think what we really want are "initContainers" that are run per deployment, not per pod. Kubernetes doesn't currently provide such a thing, but I found two related and very interesting discussions focused on this very issue:

@escattone
Copy link
Contributor

escattone commented Aug 30, 2017

Actually, what I think we want are initContainers or jobs that can be triggered on deployment-based lifecycle events. For example, it would be great to run db migrations at the beginning of a k8s deployment, and the ability to reverse the migrations on failure or rollback events.

@jwhitlock
Copy link
Contributor Author

How about a Jenkins step that generates the file and docker cps it into the image before uploading to quay.io?

@jgmize
Copy link
Contributor

jgmize commented Aug 31, 2017

Actually, what I think we want are initContainers or jobs that can be triggered on deployment-based lifecycle events. For example, it would be great to run db migrations at the beginning of a k8s deployment, and the ability to reverse the migrations on failure or rollback events.

I see from your comments above this one that you don't actually mean initContainers, since you're already aware they run for every pod creation and therefore wouldn't be a good fit for db migrations, but I'd avoid using that specific term to avoid confusion in the future. Jobs might seem like a good fit at first, but there are some caveats that need to be addressed for the specific use case of db migrations. For generating files, @jwhitlock's suggestion to do it in Jenkins at container build time makes the most sense to me.

@escattone
Copy link
Contributor

escattone commented Aug 31, 2017

Sorry, I haven't been clear enough about this. I totally agree that generating the revision files is best
in Jenkins when building the kuma and kumascript Docker images. It seems to me the problem is how to serve the kumascript-revision.txt file from the web deployment pods. Unless we (1) also generate the kumascript-revision.txt file within the kuma Docker image (from within its kumascript submodule) or (2) copy the file from the kumascript container into the shared EFS mount, the web deployment pods won't have access to kumascript-revision.txt to serve it. If we choose (1), we would have to ensure that the kumascript submodule within the kuma Docker image always stays in sync with the kumascript Docker image, right? I assumed that we weren't committing to that, but maybe I'm wrong? If we choose (2) (which is the path I've been going down), we need to have a once-per-deployment step (I'm using a Job) that copies the kumascript-revision.txt file to the shared EFS mount so the pods in the web deployment have access to the file.

As for the database migrations, I'm also using a once-per-deployment k8s Job, but I'd like to ensure that it has completed before the kuma-based deployment pods start, and I'm not sure what the best/recommended way to do that is. I'm guessing it may be to do something like kubctl get job -o json | jq ... to get the status of the Job, and use an initContainer in the kuma-based deployments that waits for the Job to complete?

@escattone
Copy link
Contributor

@jwhitlock and I met to discuss the issue of how best to serve the kumascript revision from the kuma-image-based web deployment pods, and decided the following would be the most solid (least error-prone) approach:

  • generate kumascript-revision.txt in Jenkins and store in the kumascript Docker image when building/pushing (but before doing this work, review and merge bug 1340342: Build and test kumascript as a submodule of kuma mdn/kumascript#283)
  • add a new endpoint to kumascript that serves kumascript-revision.txt
  • add a new Django endpoint for serving kumascript-revision.txt which in turn serves the value it requests and receives from the kumascript service

@escattone
Copy link
Contributor

Submitted mdn/kuma#4399 and mdn/kumascript#303 for review to address the first two bullet points.

@escattone
Copy link
Contributor

Submitted mdn/kuma#4401 to address the final bullet point in #445 (comment) as well as the second bullet point in #445 (comment). I think these PR's fully address this issue, since there is already a Django view for serving robots.txt.

@bookshelfdave
Copy link
Contributor

@escattone can this issue be closed?

@escattone
Copy link
Contributor

@metadave Yes it can!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants