вторник, 12 апреля 2016 г.

Text format to columns on Haskell

This is the language study excersice which reformat text input into text with columns. All process will have follow parameters:

  • ncols - number of columns
  • nspaces - spaces number between columns
  • width - width of columns
  • pagesize - lines of page

We will use next modules and exported names:

import           Data.List
import           Data.Maybe            (fromMaybe, isJust)
import           Prelude               hiding (Word)
import           System.Console.GetOpt
import           System.Environment

Parameters of re-formating will be got from command line, which is parsing with standard Haskell module GetOpt (see above). Btw, this parser (be default?) supports "-aN" but not "-a N".

This is the "record" for CLI options:

-- | Stricts to get syntax errors immediately
data CmdOpts = CmdOpts {
  help     :: !Bool,
  ncols    :: !Int, -- number of columns
  nspaces  :: !Int, -- spaces between columns
  width    :: !Int, -- width of column
  pagesize :: !Int -- lines in page
  } deriving Show

Their fields are strict to get syntax error immediately. We are reading input text from "stdin" so will we are waiting for input (WITH OPTION ERROR!) we won't get error reporting - so, they will be strict.

This null-arity function returns the default values of CLI options (if some are omitted):

-- | Default values of command line options
defaultCmdOpts :: CmdOpts
defaultCmdOpts = CmdOpts {
  help = False,
  ncols = 2,
  nspaces = 4,
  width = 20,
  pagesize = 24
  }

Then, as in other languages (argparse in Python ,for example) we should define options: their names, types, numbers, mandatory, etc.:

-- | Definition of program command line options
cmdSyntax :: [OptDescr (CmdOpts -> CmdOpts)]
cmdSyntax =
  [
    Option ['h', '?'] ["help"]
      (NoArg (\cmdOpts -> cmdOpts {help=True} ))
      "print this help",

    Option ['c'] ["cols"]
      (OptArg (\s cmdOpts -> cmdOpts {ncols=readOpt s}) "NCOLS")
      "columns number (default: 2)",

    Option ['s'] ["spaces"]
      (OptArg (\s cmdOpts -> cmdOpts {nspaces=readOpt s}) "NSPACES")
      "spaces number (default: 4)",

    Option ['w'] ["width"]
      (OptArg (\s cmdOpts -> cmdOpts {width=readOpt s}) "WIDTH")
      "column width (default: 20)",

    Option ['p'] ["pagesize"]
      (OptArg (\s cmdOpts -> cmdOpts {pagesize=readOpt s}) "PAGESIZE")
      "page size in lines (default: 25)"
  ] where
  readOpt :: Read a => Maybe [Char] -> a
  readOpt s = (read $ fromMaybe "" s)

First arg of OptArg is lambda which gets string value of option and aggregated options values - it must parse string to internal type and to save result into this cmdOpts for further processing.

Parsing function is parseCmdOpts which works with external world so operates in IOcontext:

-- | Usage string
usage :: String
usage = usageInfo "SYNTAX: col [options...]" cmdSyntax


-- | Parses command line options
parseCmdOpts :: IO CmdOpts
parseCmdOpts = do
  argv <- getArgs
  case getOpt Permute cmdSyntax argv of
    (opts, _, []) -> return $ foldl (flip id) defaultCmdOpts opts
    (_, _, errs) -> error $ concat errs ++ "\n" ++ usage

This is the utility function which split sequence into pies with size n and is based on standard function splitAt:

-- | Splits `xs` on pies with the same size `n`
splitEvery :: Int -> [a] -> [[a]]
splitEvery _ [] = []
splitEvery n xs = as : splitEvery n bs
  where (as, bs) = splitAt n xs

To simplify our types expressions (signatures) we set types definitions for word (symbols without spaces), line, column and page:

-- | Typedefs
type Word = String
type Line = String
type Column = [Line]
type Page = [Column]

Here we begin actual processing. First, we need to make main working function which gets words and cuts (extracts) line from them but saves rest of words for further processing. It calculate how many words we need, then calculate how much they must be sparse (to fill width of column) and returns result as Line and list of words rest. If it's impossible to extract line with such width then result will be Nothing (if possible - line is wrapped into Just). Idea is to calculate lengths of words but (to simplify further calculations) additively: not [2, 3, 4] but [2, 6, 10], i.e. lengths of words plus one-symbol spaces (minimal needed symbols). Then we takes as much words as we need to keep line length <= width. This does wl = ... line in the code:

-- | Cuts line with width `w` from words `ws` by drop some words and adding
-- spaces between kept ones. If it's possible returns (Maybe line, rest-of-words),
-- otherwise (Nothing, all-words)
cutLine :: Int -> [Word] -> (Maybe Line, [Word])
cutLine _ [] = (Just "", [])
cutLine w ws =
  let wl = takeWhile (<=w) $ scanl1 ((+).(+1)) $ map length ws -- additive words lengths
      wn = length wl -- words number
      (w0, w1) = splitAt wn ws
      gn = wn - 1 -- gaps number
  in
    case gn of
      -1 -> (Nothing, ws)
      0 -> (Just $ w0!!0, w1)
      _ ->
        let ag = w - last wl -- additional gaps needed
            gw = 1 + ag `div` gn -- gaps width (except last one)
            lgw = gw + (ag `mod` gn) -- last gap width
            gap = replicate gw ' '
            lgap = replicate lgw ' '
            (firstWords, lastWord) = splitAt (wn-1) w0
        in
          (Just $ intercalate gap firstWords ++ lgap ++ lastWord!!0, w1)

We calculates needed spaces to sparse line to width w and add them into each gap, last gap will have remainder of division (if we are adding 5 spaces to 2 gapes then last will have 3 spaces, not 2).

Next function cuts column from words: actually we can think about resulting text as text with one column looking like snake :) So, we cut one column (all columns have the same width, laid snake column):

-- | Cuts column (list of lines) with width `w` from words `ws`. If it's possible
-- (to reformat words into new width `w`), returns Maybe column, Nothing otherwise
cutColumn :: Int -> [Word] -> Maybe Column
cutColumn w ws =
  let (mbLine, othWords) = cutLine w ws in
    do
      line <- mbLine
      case othWords of
        [] -> return [line]
        _ -> do othLines <- (cutColumn w othWords)
                return $ [line] ++ othLines

Now we will cut the "snake" into separate pies - real columns. We need (sure) to know page size (ps argument) and width of columns (w). We use >>= bind function to drop Nothing (into Nothing). And yes, we actually cut the snake - with splitEvery - into the same size parts (size is ps):

-- | Cuts columns with width `w` for page with size `ps` (lines in page) from input
-- text `txt`
cutColumns :: Int -> Int -> String -> Maybe [Column]
cutColumns w ps txt =
  let ws = words txt in
    cutColumn w ws >>= \c -> return $ splitEvery ps c

And last we cuts pages from list of columns, principle is the same:

-- | Cuts pages from early cutted (reformatted) columns `mbCols` when number of columns
-- is `nc`
cutPages :: Int -> Maybe [Column] -> Maybe [Page]
cutPages nc mbCols = mbCols >>= \cols -> return $ splitEvery nc cols

Now we need utility function, zip - not for 2 sequences, but for any number of them. enumFrom is analogue of Python range(0), we select each time item with the same index - like in real zip. But since we have any number of sequence, they are represented as list of sequences seqs. Sequences can have different number of items, so we complement each one with Nothing (NULL), but then existing items will be wrapped in Just. We stop to consume resulting sub-lists when all their items will be Nothing:

-- | zip for N sequences
zipN :: [[a]] -> [[Maybe a]]
zipN seqs =
  let seqs' = map (\sq -> map Just sq ++ repeat Nothing) seqs in
  takeWhile (any isJust) [[sq!!i | sq <- seqs'] | i <- enumFrom 0]

See picture:

  1     2     3     N     M        <- seq1
  2     3     N     N     N        <- seq2
[1,2] [2,3] [3,N] skip! skip! ...  <- zipN

Now we should "fold" results into text (from structured data). Folding of page will creates page into ONE column (the snake, yes:)

-- | Folds page `page` with width of column `w` and vertical space `vSpace` into string
foldPage :: String -> Int -> Page -> Column
foldPage vSpace w page =
 let emptyCol = replicate w ' ' in
 map (\c -> intercalate vSpace $ map (fromMaybe emptyCol) c) (zipN page)

Then we can fold pages into text (String). "Pages" - because result of cut is list of pages (which are columns which are lines which are words). Each page will be delimited with string like (page number):


                                 - 2 -

named colontit. It is justified with pw - page width.

vSpace here is vertical space between columns. enumPage is enumerated pages with... zip, zip does it in very natural way - it "adds" to page it's number (from 1, see enumFrom 1):

-- | Folds pages `pages` with columns of width `w` and page width `pw` into string
foldPages :: String -> Int -> Int -> [Page] -> String
foldPages vSpace w pw pages =
  let showPage i = "- " ++ show i ++ " -"
      pdel = "\n\n"
      colontit i = pdel
                   ++ replicate (pw `div` 2 - length (showPage i) `div` 2) ' '
                   ++ (showPage i) ++ pdel
      enumPages = zip (map (intercalate "\n" . foldPage vSpace w) pages) (enumFrom 1)
  in
    intercalate "" $ map (\(p,i) -> p ++ colontit (i::Integer)) enumPages

foldPage folds pages into one column (list of lines) and then we are joing column lines with "\n". Last function is col which joins together all processing:

-- | Reformats input text `txt` into new text with columns which options are setted
-- in `opts`
col :: CmdOpts -> String -> String
col opts txt =
  let mbCols = cutColumns (width opts) (pagesize opts) txt
      mbPages = cutPages (ncols opts) mbCols
      vSpace = replicate (nspaces opts) ' '
      pageWidth = (ncols opts) * (width opts) + (ncols opts - 1) * (nspaces opts)
  in
  foldPages vSpace (width opts) pageWidth (fromMaybe [] mbPages)

pageWidth we need to justify colontit.

Main entry point is simple and runs all I/O: processing of "stdin" content (see getContents) with out col function. But if we have fHelp then we print usage help instead (user enters "-h" CLI option):

main :: IO()
main = do
  cmdOpts <- parseCmdOpts
  let CmdOpts { help = fHelp } = cmdOpts in
    if fHelp then putStrLn usage
    else do
         inTxt <- getContents
         putStr $ col cmdOpts inTxt

fHelp is boolean flag mathed help field of our CLI options record. Actually it's simple example, but looks interesting. To run it use:

cat somefile.txt | runghc col.hs -w40 | less

This code is available in my Git