Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK font hijacks smart quotes when compiling PDF with XeLaTeX #7509

Open
goderich opened this issue Aug 19, 2021 · 21 comments
Open

CJK font hijacks smart quotes when compiling PDF with XeLaTeX #7509

goderich opened this issue Aug 19, 2021 · 21 comments

Comments

@goderich
Copy link

goderich commented Aug 19, 2021

Explain the problem.
When compiling a markdown document with mixed Latin and CJK text and using --pdf-engine=xelatex with smart quotes enabled, all single and double quotes become full-width CJK quotes. Basically the CJK font hijacks the smart quotes.

From my testing, this happens only with XeLaTeX, and only with smart quotes enabled. This did not happen last year (2020), when I was using pandoc extensively, but I discovered the bug a couple of months ago, and it still persists.

You don't even need to have actual CJK text in the file, simply declaring a CJK font produces the bug.

MWE:

---
CJKmainfont: Noto Sans CJK TC
---
It's a "bug"!

When compiling with pandoc --pdf-engine=xelatex -i bug.md -o bug.pdf, the single quote in it's and the double quotes around bug are noticeably wider (full width CJK quotes, from my experience with them). Compiling with other PDF engines or with --from markdown-smart does not produce these wide quote marks.

Pandoc version?
pandoc 2.14.0.2 on Archlinux

@goderich goderich added the bug label Aug 19, 2021
@jgm
Copy link
Owner

jgm commented Aug 19, 2021

Unless you can see a problem with the LaTeX output pandoc is producing, this looks more like a problem with xelatex than with pandoc. Can you post on the tex stack exchange to see if the experts there have any ideas what is causing this, or how to work around it?

@goderich
Copy link
Author

goderich commented Aug 19, 2021

@jgm Thank you for the swift reply!

I just tried using pandoc to make a tex file and then manually running xelatex on it. This does not produce the bug.
The bug only happens when I try to create a PDF from a markdown file (I haven't tried other input though) with pandoc using the xelatex backend.

I think with this new information it looks like the problem might be with pandoc after all?
The whole smart quotes thing was weird, too. If it was xelatex, turning smart quotes off in pandoc shouldn't have had an effect.

@jgm
Copy link
Owner

jgm commented Aug 19, 2021

In producing a PDF via pandoc, we disable the smart extension when creating the intermediate LaTeX file, to avoid bad ligatures like ?`. That's probably why you're seeing a difference. Try creating the LaTeX file using -t latex-smart.

@goderich
Copy link
Author

Ah, yes, that does produce a tex file that gives full-width CJK quotes when compiled.

So this is fully on xelatex then?

@goderich
Copy link
Author

goderich commented Aug 19, 2021

Looks like it's indeed xelatex, specifically the xeCJK package. I got a MWE for it:

\documentclass{article}
\usepackage{xeCJK}

\begin{document}
It’s a “bug”!
\end{document}

Compiling this with xelatex has the same buggy output with full-width CJK quotes.

@goderich
Copy link
Author

goderich commented Aug 19, 2021

I found this on tex stack exchange: https://tex.stackexchange.com/questions/36878/xecjk-messes-with-punctuation

So according to the top answer, this is a feature. That's a problem for users like me, who need to use mixed English and CJK fonts.
Also, that answer is from 10 years ago, but I didn't experience this problem last year. So this appears to be a relatively recent change in pandoc?

@jgm
Copy link
Owner

jgm commented Aug 19, 2021

Yes, relatively recent to disable smart in producing the intermediate tex file.
You can of course work around this by generating the tex yourself and compiling.
(or try with lualatex?)

@goderich
Copy link
Author

Is this a “wontfix” then?

@jgm
Copy link
Owner

jgm commented Aug 19, 2021

The only possible fix I can think of is for the PDF module to check to see if the metadata or variables includes CJKmainfont ; and if so, use smart in generating the intermediate tex. That re-introduces a risk of weird ligature collisions, but since these mostly come from settings for European language, maybe it's okay.

@goderich
Copy link
Author

Just tried lualatex and the bug is still there.

@goderich
Copy link
Author

I'm not sure what ligatures you mean exactly. Is it stuff like {\"a}? Do people use those with xelatex? I thought the whole point of xelatex was to let people write unicode directly.

@jgm
Copy link
Owner

jgm commented Aug 19, 2021

Ligatures like `` ... '' for double quotes, -- for n-dash, etc.
Yes, currently we do NOT use these with xelatex in producing a PDF -- that's why we disable smart for LaTeX.
But the stackexchange link above says that the recommended way to work around this issue is to use `` ... '' for quotes, instead of unicode curly quotes, which will automatically be interpreted as CJK.

@goderich
Copy link
Author

I'm sorry, but I didn't quite understand whether you plan to address this.

What would be the issue with using `` ... '' instead of curly quotes? (Do they introduce unwanted ligatures?) Can it be triggered only when using xeCJK, or CJKmainfont as you suggested?

@jgm
Copy link
Owner

jgm commented Aug 20, 2021

I note a possible change to pandoc above, with a possible disadvantage it would have. The reason we don't use the `` ligatures by default in generating PDFs is that the language support in babel/polyglossia tends to define language-specific ligatures (I can't remember them all, but stuff like `? that interact badly with these. (You can search this tracker for examples, e.g. #4695.)

So, if we use smart in generating the PDF when CJKMainFont is used, there's potential for issues of this kind, if the western language used is one of the ones that use these ligatures.

Probably it's worth doing, which is why I haven't closed this.

@jgm
Copy link
Owner

jgm commented Aug 20, 2021

But what you should do in the mean time is simply generate a standalone tex file (with -t latex+smart -s) and compile it yourself.

@goderich
Copy link
Author

OK, I understand now, thank you.

@jgm
Copy link
Owner

jgm commented Sep 1, 2021

I'm not able to reproduce this with my tex setup.

Oddly, I can reproduce the issue with your pure latex case.
But when I use pandoc -o my.pdf --pdf-engine=xelatex and specify a CJKmainfont as in your example, the quotes look fine! I can't understand why. The intermediate tex file has curly unicode quotes, not ligatures.

@jgm
Copy link
Owner

jgm commented Sep 1, 2021

Anyway, here's a patch that disables smart in producing LaTeX only if CJKmainfont isn't specified:

diff --git a/src/Text/Pandoc/PDF.hs b/src/Text/Pandoc/PDF.hs
index 9ff4bfb09..4c0514e34 100644
--- a/src/Text/Pandoc/PDF.hs
+++ b/src/Text/Pandoc/PDF.hs
@@ -24,7 +24,7 @@ import qualified Data.ByteString as BS
 import Data.ByteString.Lazy (ByteString)
 import qualified Data.ByteString.Lazy as BL
 import qualified Data.ByteString.Lazy.Char8 as BC
-import Data.Maybe (fromMaybe)
+import Data.Maybe (fromMaybe, isJust)
 import Data.Text (Text)
 import qualified Data.Text as T
 import qualified Data.Text.Lazy as TL
@@ -51,6 +51,7 @@ import Text.Pandoc.Shared (inDirectory, stringify, tshow)
 import qualified Text.Pandoc.UTF8 as UTF8
 import Text.Pandoc.Walk (walkM)
 import Text.Pandoc.Writers.Shared (getField, metaToContext)
+import Text.DocTemplates (lookupContext)
 import Control.Monad.Catch (MonadMask)
 #ifdef _WINDOWS
 import Data.List (intercalate)
@@ -97,10 +98,16 @@ makePDF program pdfargs writer opts doc =
 #else
         let tmpdir = tmpdir'
 #endif
-        doc' <- handleImages opts tmpdir doc
+        doc'@(Pandoc meta _) <- handleImages opts tmpdir doc
+        let cjk = -- see #7509, #7535
+                  isJust (lookupMeta "CJKmainFont" meta) ||
+                  isJust (lookupContext "CJKmainFont" (writerVariables opts)
+                            :: Maybe Text)
         source <- writer opts{ writerExtensions = -- disable use of quote
                                   -- ligatures to avoid bad ligatures like ?`
-                                  disableExtension Ext_smart
+                                  (if cjk
+                                      then id
+                                      else disableExtension Ext_smart)
                                    (writerExtensions opts) } doc'
         case baseProg of
           "context" -> context2pdf program pdfargs tmpdir source

Since I can't reproduce the issue yet, I'm a bit reluctant to apply this.

@goderich
Copy link
Author

goderich commented Sep 2, 2021

I'm not able to reproduce this with my tex setup.

Oddly, I can reproduce the issue with your pure latex case.
But when I use pandoc -o my.pdf --pdf-engine=xelatex and specify a CJKmainfont as in your example, the quotes look fine! I can't understand why. The intermediate tex file has curly unicode quotes, not ligatures.

Huh. Well that's weird. Thank you for looking into this.
I'm using a prepackaged binary on my distribution, and right now as a workaround I'm using a Makefile which generates latex+smart with pandoc, and then compiles the result using xelatex.

I'm afraid I don't have enough time at the moment to explore this further, but I should be able to help debug this further in a couple of months. I'm running Archlinux on both my machines though, so I can't test with other systems.

@khemarato
Copy link

khemarato commented May 27, 2022

I'm also seeing this bug. Frustratingly, I can't seem to compile my document using the lualatex engine either, as it can't find my CJK font (it's removing the spaces from the font name and then saying it can't find that?) 😞

For those reading this later, you can turn off the smart extension using the --from=markdown-smart flag

@oldjove
Copy link

oldjove commented Jun 13, 2022

I also experienced this bug. After some searching around, I found this solution, whereby you reassign the class of the offending characters. I've added this code to my template and find that it's solved the problem:

\AtBeginDocument{%
\XeTeXcharclass^^^^2026=0 \XeTeXcharclass^^^^2019=0
\XeTeXcharclass^^^^2013=0 \XeTeXcharclass“=0
\XeTeXcharclass”=0 \XeTeXcharclass‘=0
}

goderich added a commit to goderich/dotfiles that referenced this issue Dec 19, 2022
As per jgm/pandoc#7509 (comment)
(somewhat badly formatted due to backticks), the issue of CJK fonts
highjacking punctuation in xelatex that has been bothering me for a
couple of years can be mitigated.

I am applying the fix at the default file level, meaning that every file
I compile with `-dpdf` will have it. After a couple of tests, it seems
to be working great: PDFs get compiled nicely with no fighting between
CJK fonts and smart punctuation.

However, the smart punctuation functionality is disabled by default for
org-mode input and has to be enabled with `--from=org+smart` when
compiling from the command line. The upside of this is that I no longer
need to use backticks when writing quotes in org-mode for later PDF
generation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants