بسم الله الرحمن الرحيم
الحمد لله والصلاة والسلام على نبينا محمد
أشهد أن لا إله إلا الله وأشهد أن محمدا عبده ورسوله
أما بعد:

1 Abstract

Arabic romanization schemes often use diacritics which are inconvenient to type on a standard US English keyboard. It is possible to enable International keyboard settings or to type Unicode characters directly but this too is cumbersome. Here we will present a method to type in the romanization using plain ASCII (similar to the Buckwalter scheme) in an Rmarkdown document, and use Pandoc Lua filters under the hood to render the romanization with the diacritics.

2 Introduction

Rmarkdown is a syntax to type rich text documents in plain ASCII, which can then be converted to HTML pages, or PDF documents.

3 Supporting Arabic text in Rmarkdown

If our article is in English and we would like to add inline Arabic text we may enter it within a span element. Example, the Rmarkdown text input:

This is some english text with some inline Arabic text [أهلا وسهلا]{lang="ar", dir="rtl"}

This is rendered as:

This is some english text with some inline Arabic text أهلا وسهلا

We can simplify this by adding a Pandoc Lua filter so that instead of typing:

[أهلا وسهلا]{lang="ar", dir="rtl"}

we can type

[أهلا وسهلا]{.ar}

to get the same output:

أهلا وسهلا

To do this create a file myfilter.lua and add the following code to it:

function Span (elem)
  if elem.classes[1] == 'ar' then
    attrs = pandoc.Attr("", {}, {{"lang", "ar"},{"dir","rtl"}})
    return pandoc.Span(elem.content, attrs)
  else
    return elem
  end
end

Then add the following line to your Rmarkdown document preamble:

    pandoc_args: ["--lua-filter=myfilter.lua"]

4 Romanization

There are various romanization schemes described here: https://en.wikipedia.org/wiki/Romanization_of_Arabic

In order to avoid typing all the diacritics directly, which can become cumbersome, we can use Pandoc Lua filters again. First we have to define a mapping for Arabic characters to ASCII input characters and to Latin rendered output. This mapping is customizable to what one prefers.

The mapping I use and the Pandoc Lua filter code can be seen below:

function RomanizeMapping(text2)
  -- use digraphs sh, th, etc for some characters
  digraph_en = false

  -- lower case mapping
  mylcase = {}
  mylcase["E"] = "ʾ" -- hamza
  mylcase["A"] = "ā"
  mylcase["v"] = "ṯ" -- thaa
  mylcase["j"] = "j" -- jeem
  mylcase["H"] = "ḥ"
  mylcase["x"] = "ḵ" -- Khaa
  mylcase["p"] = "ḏ" -- dhal
  mylcase["c"] = "š" -- sheen
  mylcase["S"] = "ṣ"
  mylcase["D"] = "ḍ"
  mylcase["T"] = "ṭ"
  mylcase["P"] = "ḏ̣" -- DHaa
  mylcase["e"] = "ɛ" -- 3ayn
  mylcase["g"] = "ġ" -- ghayn
  mylcase["o"] = "ḧ" -- for taa marbuta in pausa non-construct
  mylcase["O"] = "ẗ" -- for taa marbuta in pausa construct
  mylcase["I"] = "ī"
  mylcase["U"] = "ū"
  mylcase["="] = "·" -- to insert middot explicitly. middot is automatically inserted before 'h' if digraph_en=true

  -- upper case mapping. use hash '#' before desired uppercase character
  myucase = {}
  myucase["E"] = "ʾ"
  myucase["A"] = "Ā"
  myucase["v"] = "Ṯ"
  myucase["j"] = "J"
  myucase["H"] = "Ḥ"
  myucase["x"] = "Ḵ"
  myucase["p"] = "Ḏ"
  myucase["c"] = "Š"
  myucase["S"] = "Ṣ"
  myucase["D"] = "Ḍ"
  myucase["T"] = "Ṭ"
  myucase["P"] = "Ḏ̣"
  myucase["e"] = "Ɛ"
  myucase["g"] = "Ġ"
  myucase["I"] = "Ī"
  myucase["U"] = "Ū"
  myucase["a"] = "A"
  myucase["i"] = "I"
  myucase["u"] = "U"
  myucase["b"] = "B"
  myucase["t"] = "T"
  myucase["d"] = "D"
  myucase["r"] = "R"
  myucase["z"] = "Z"
  myucase["s"] = "S"
  myucase["f"] = "F"
  myucase["q"] = "Q"
  myucase["k"] = "K"
  myucase["l"] = "L"
  myucase["m"] = "M"
  myucase["n"] = "N"
  myucase["h"] = "H"
  myucase["w"] = "W"
  myucase["y"] = "Y"

  if digraph_en then
    mylcase["v"] = "t͟h"
    myucase["v"] = "T͟h"
    mylcase["c"] = "s͟h"
    myucase["c"] = "S͟h"
    mylcase["x"] = "k͟h"
    myucase["x"] = "K͟h"
    mylcase["g"] = "g͟h"
    myucase["g"] = "G͟h"
    mylcase["p"] = "d͟h"
    myucase["p"] = "D͟h"
    mylcase["P"] = "d͟͏̣h"
    myucase["P"] = "D͟͏̣h"
  end

  text3 = ''
  local caps = false
  local prev_charv = ''
  for index3 = 1, #text2 do
    local charv = text2:sub(index3, index3)
    if charv == "#" then
      caps = true
    else
      if caps then
        if myucase[charv] == nil then
          text3 = text3 .. charv
        else
          text3 = text3 .. myucase[charv]
        end
        caps = false
      else
        if digraph_en and charv == 'h' and prev_charv ~= '=' and (prev_charv == 't' or prev_charv == 's' or prev_charv == 'k' or prev_charv == 'd' or prev_charv == 'p' or prev_charv == 'P' or prev_charv == 'D' or prev_charv == 'c' or prev_charv == 'v' or prev_charv == 'x' or prev_charv == 'g') then
          text3 = text3 .. "·"
        end
        if mylcase[charv] == nil then
          text3 = text3 .. charv
        else
          text3 = text3 .. mylcase[charv]
        end
      end
    end
    prev_charv = charv
  end
  return text3
end
function Romanize (elem)
  for index,text in pairs(elem.content) do
    for index2,text2 in pairs(text) do
      text3 = RomanizeMapping(text2)
      text[index2] = text3
    end
    elem.content[index] = text
  end
  return (elem.content)
end
function Span (elem)
  if elem.classes[1] == 'trn' then
    return pandoc.Emph (Romanize(elem))
  elseif elem.classes[1] == 'trn2' then
    return (Romanize(elem))
  elseif elem.classes[1] == 'ar' then
    attrs = pandoc.Attr("", {}, {{"lang", "ar"},{"dir","rtl"}})
    return pandoc.Span(elem.content, attrs)
  else
    return elem
  end
end

Now we may give the following input

[Ealeilmu fi -SSigari kannaqci fi -lHajar]{.trn}

and it will be rendered thus:

ʾalɛilmu fi -ṣṣiġari kannaqši fi -lḥajar

Note that I have a variable digraph_en = false. This can be switched to true with the following variation.

Arabic Latin (digraph_en=false) Latin (digraph_en=true)
ث t͟h
خ k͟h
ذ d͟h
ش š s͟h
ظ ḏ̣ d͟͏̣h
غ ġ g͟h

If digraph_en= true is chosen, then the romanized text will automatically insert the middot character (·) when needed before h to remove ambiguity.

For example, the Arabic text:

يَسْهُلُ، يَتْرُكْهُ، مَشْهَد، مَذْهَب، يُبْغِضْهُ، مَظْهَر، يَبْعَثْهُ، يُؤَرِّخْهُ، يُبْعِدْهُ, يُبْلِغْهُ

is entered for romanization as:

[yashulu, yatrukhu, machad, maphab, yubgiDhu, maPhar, yabeavhu, yuEarrixhu, yubeidhu]{.trn}

and is output as:

digraph_en = true: yas·hulu, yatruk·hu, mas͟h·had, mad͟h·hab, yubg͟hiḍ·hu, mad͟͏̣h·har, yabɛat͟h·hu, yuʾarrik͟h·hu, yubɛid·hu, yublig͟h·hu

digraph_en = false: yashulu, yatrukhu, mašhad, maḏhab, yubġiḍhu, maḏ̣har, yabɛaṯhu, yuʾarriḵhu, yubɛidhu, yubliġhu

We also have an option to romanize text without putting it into italics (by using .trn2) and also to use uppercase characters (by inputing #). This can be useful for proper nouns:

Input:

[#zayd ibn E#abI #eamr]{.trn2}

Output:

Zayd ibn ʾAbī Ɛamr

5 More details

For more details you may browse the code here: https://github.com/adamiturabi/rmd-arabic-romaniz