Using Pandoc Lua filters to input and romanize Arabic in Rmarkdown documents
بسم الله الرحمن الرحيم
الحمد لله والصلاة والسلام على نبينا محمد
أشهد أن لا إله إلا الله وأشهد أن محمدا عبده ورسوله
أما بعد:
1 Abstract
Arabic romanization schemes often use diacritics which are inconvenient to type on a standard US English keyboard. It is possible to enable International keyboard settings or to type Unicode characters directly but this too is cumbersome. Here we will present a method to type in the romanization using plain ASCII (similar to the Buckwalter scheme) in an Rmarkdown document, and use Pandoc Lua filters under the hood to render the romanization with the diacritics.
2 Introduction
Rmarkdown is a syntax to type rich text documents in plain ASCII, which can then be converted to HTML pages, or PDF documents.
3 Supporting Arabic text in Rmarkdown
If our article is in English and we would like to add inline Arabic text we may enter it within a span element. Example, the Rmarkdown text input:
This is some english text with some inline Arabic text [أهلا وسهلا]{lang="ar", dir="rtl"}
This is rendered as:
This is some english text with some inline Arabic text أهلا وسهلا
We can simplify this by adding a Pandoc Lua filter so that instead of typing:
[أهلا وسهلا]{lang="ar", dir="rtl"}
we can type
[أهلا وسهلا]{.ar}
to get the same output:
أهلا وسهلا
To do this create a file myfilter.lua
and add the following code to it:
function Span (elem)
if elem.classes[1] == 'ar' then
attrs = pandoc.Attr("", {}, {{"lang", "ar"},{"dir","rtl"}})
return pandoc.Span(elem.content, attrs)
else
return elem
end
end
Then add the following line to your Rmarkdown document preamble:
pandoc_args: ["--lua-filter=myfilter.lua"]
4 Romanization
There are various romanization schemes described here: https://en.wikipedia.org/wiki/Romanization_of_Arabic
In order to avoid typing all the diacritics directly, which can become cumbersome, we can use Pandoc Lua filters again. First we have to define a mapping for Arabic characters to ASCII input characters and to Latin rendered output. This mapping is customizable to what one prefers.
The mapping I use and the Pandoc Lua filter code can be seen below:
function RomanizeMapping(text2)
-- use digraphs sh, th, etc for some characters
digraph_en = false
-- lower case mapping
mylcase = {}
mylcase["E"] = "ʾ" -- hamza
mylcase["A"] = "ā"
mylcase["v"] = "ṯ" -- thaa
mylcase["j"] = "j" -- jeem
mylcase["H"] = "ḥ"
mylcase["x"] = "ḵ" -- Khaa
mylcase["p"] = "ḏ" -- dhal
mylcase["c"] = "š" -- sheen
mylcase["S"] = "ṣ"
mylcase["D"] = "ḍ"
mylcase["T"] = "ṭ"
mylcase["P"] = "ḏ̣" -- DHaa
mylcase["e"] = "ɛ" -- 3ayn
mylcase["g"] = "ġ" -- ghayn
mylcase["o"] = "ḧ" -- for taa marbuta in pausa non-construct
mylcase["O"] = "ẗ" -- for taa marbuta in pausa construct
mylcase["I"] = "ī"
mylcase["U"] = "ū"
mylcase["="] = "·" -- to insert middot explicitly. middot is automatically inserted before 'h' if digraph_en=true
-- upper case mapping. use hash '#' before desired uppercase character
myucase = {}
myucase["E"] = "ʾ"
myucase["A"] = "Ā"
myucase["v"] = "Ṯ"
myucase["j"] = "J"
myucase["H"] = "Ḥ"
myucase["x"] = "Ḵ"
myucase["p"] = "Ḏ"
myucase["c"] = "Š"
myucase["S"] = "Ṣ"
myucase["D"] = "Ḍ"
myucase["T"] = "Ṭ"
myucase["P"] = "Ḏ̣"
myucase["e"] = "Ɛ"
myucase["g"] = "Ġ"
myucase["I"] = "Ī"
myucase["U"] = "Ū"
myucase["a"] = "A"
myucase["i"] = "I"
myucase["u"] = "U"
myucase["b"] = "B"
myucase["t"] = "T"
myucase["d"] = "D"
myucase["r"] = "R"
myucase["z"] = "Z"
myucase["s"] = "S"
myucase["f"] = "F"
myucase["q"] = "Q"
myucase["k"] = "K"
myucase["l"] = "L"
myucase["m"] = "M"
myucase["n"] = "N"
myucase["h"] = "H"
myucase["w"] = "W"
myucase["y"] = "Y"
if digraph_en then
mylcase["v"] = "t͟h"
myucase["v"] = "T͟h"
mylcase["c"] = "s͟h"
myucase["c"] = "S͟h"
mylcase["x"] = "k͟h"
myucase["x"] = "K͟h"
mylcase["g"] = "g͟h"
myucase["g"] = "G͟h"
mylcase["p"] = "d͟h"
myucase["p"] = "D͟h"
mylcase["P"] = "d͟͏̣h"
myucase["P"] = "D͟͏̣h"
end
text3 = ''
local caps = false
local prev_charv = ''
for index3 = 1, #text2 do
local charv = text2:sub(index3, index3)
if charv == "#" then
caps = true
else
if caps then
if myucase[charv] == nil then
text3 = text3 .. charv
else
text3 = text3 .. myucase[charv]
end
caps = false
else
if digraph_en and charv == 'h' and prev_charv ~= '=' and (prev_charv == 't' or prev_charv == 's' or prev_charv == 'k' or prev_charv == 'd' or prev_charv == 'p' or prev_charv == 'P' or prev_charv == 'D' or prev_charv == 'c' or prev_charv == 'v' or prev_charv == 'x' or prev_charv == 'g') then
text3 = text3 .. "·"
end
if mylcase[charv] == nil then
text3 = text3 .. charv
else
text3 = text3 .. mylcase[charv]
end
end
end
prev_charv = charv
end
return text3
end
function Romanize (elem)
for index,text in pairs(elem.content) do
for index2,text2 in pairs(text) do
text3 = RomanizeMapping(text2)
text[index2] = text3
end
elem.content[index] = text
end
return (elem.content)
end
function Span (elem)
if elem.classes[1] == 'trn' then
return pandoc.Emph (Romanize(elem))
elseif elem.classes[1] == 'trn2' then
return (Romanize(elem))
elseif elem.classes[1] == 'ar' then
attrs = pandoc.Attr("", {}, {{"lang", "ar"},{"dir","rtl"}})
return pandoc.Span(elem.content, attrs)
else
return elem
end
end
Now we may give the following input
[Ealeilmu fi -SSigari kannaqci fi -lHajar]{.trn}
and it will be rendered thus:
ʾalɛilmu fi -ṣṣiġari kannaqši fi -lḥajar
Note that I have a variable digraph_en = false
. This can be switched to true with the following variation.
Arabic | Latin (digraph_en=false ) |
Latin (digraph_en=true ) |
---|---|---|
ث | ṯ | t͟h |
خ | ḵ | k͟h |
ذ | ḏ | d͟h |
ش | š | s͟h |
ظ | ḏ̣ | d͟͏̣h |
غ | ġ | g͟h |
If digraph_en= true
is chosen, then the romanized text will automatically insert the middot character (·) when needed before h to remove ambiguity.
For example, the Arabic text:
يَسْهُلُ، يَتْرُكْهُ، مَشْهَد، مَذْهَب، يُبْغِضْهُ، مَظْهَر، يَبْعَثْهُ، يُؤَرِّخْهُ، يُبْعِدْهُ, يُبْلِغْهُ
is entered for romanization as:
[yashulu, yatrukhu, machad, maphab, yubgiDhu, maPhar, yabeavhu, yuEarrixhu, yubeidhu]{.trn}
and is output as:
digraph_en = true
: yas·hulu, yatruk·hu, mas͟h·had, mad͟h·hab, yubg͟hiḍ·hu, mad͟͏̣h·har, yabɛat͟h·hu, yuʾarrik͟h·hu, yubɛid·hu, yublig͟h·hu
digraph_en = false
: yashulu, yatrukhu, mašhad, maḏhab, yubġiḍhu, maḏ̣har, yabɛaṯhu, yuʾarriḵhu, yubɛidhu, yubliġhu
We also have an option to romanize text without putting it into italics (by using .trn2
) and also to use uppercase characters (by inputing #
). This can be useful for proper nouns:
Input:
[#zayd ibn E#abI #eamr]{.trn2}
Output:
Zayd ibn ʾAbī Ɛamr
5 More details
For more details you may browse the code here: https://github.com/adamiturabi/rmd-arabic-romaniz