grafter.tabular

Functions for processing tabular data.

_

An alias for the identity function, used for providing positional arguments to mapc.

add-column

(add-column dataset new-column value)
Add a new column to a dataset with the supplied value lazily copied
into every row within it.

add-columns

(add-columns dataset hash)(add-columns dataset source-cols f)(add-columns dataset new-col-ids source-cols f)
Add several new columns to a dataset at once.  There are a number of different parameterisations:

(add-columns ds {:foo 10 :bar 20})

Calling with two arguments where the second argument is a hash map
creates new columns in the dataset for each of the hashmaps keys and
copies the hashes values lazily down all the rows.  This
parameterisation is designed to work well build-lookup-table.

When given either a single column id or many along with a function
which returns a hashmap, add-columns will pass each cell from the
specified columns into the given function, and then associate its
returned map back into the dataset.  e.g.

(add-columns ds "a" (fn [a] {:b (inc a) :c (inc a)} ))

; =>

| a | :b | :c |
|---+----+----|
| 0 |  1 |  1 |
| 1 |  2 |  2 |

As a dataset needs to know its columns in this case it will infer
them from the return value of the first row.  If you don't want to
infer them from the first row then you can also supply them like so:

(add-columns ds [:b :c] "a" (fn [a] {:b (inc a) :c (inc a)} ))

; =>

| a | :b | :c |
|---+----+----|
| 0 |  1 |  1 |
| 1 |  2 |  2 |

all-columns

(all-columns dataset cols)
Takes a dataset and any number of integers corresponding to column
numbers and returns a dataset containing only those columns.

If you want to use infinite sequences of columns or allow the
specification of more cols than are in the data without error you
should use columns instead.  Using an infinite sequence with this
function will result in non-termination.

One advantage of this over using columns is that you can duplicate
an arbitrary number of columns.

apply-columns

(apply-columns dataset fs)
Like mapc in that you associate functions with particular columns,
though it differs in that the functions given to mapc should receive
and return values for individual cells.

With apply-columns, the function receives a collection of cell
values from the column and should return a collection of values for
the column.

It is also possible to create new columns with apply-columns for
example to assign row ids you can do:

(apply-columns ds {:row-id (fn [_] (grafter.sequences/integers-from 0))})

build-lookup-table

(build-lookup-table dataset key-cols)(build-lookup-table dataset key-cols return-keys)
Takes a dataset, a vector of any number of column names corresponding
to key columns and a column name corresponding to the value
column.
Returns a function, taking a vector of keys as
argument and returning the value wanted

column-names

If given a dataset, it returns its column names. If given a dataset and a sequence
of column names, it returns a dataset with the given column names.

columns

(columns dataset cols)
Given a dataset and some columns, narrow the dataset to just the
supplied columns.

cols are paired off with columns in the data and then a selection is
done.  Any cols left over after the pairing are discarded, but if a
selected col is not actually in the data an IndexOutOfBoundsException will
be thrown.

This function can safely be used with infinite sequences.

dataset?

(dataset? ds)
Predicate function to test whether the supplied argument is a
dataset or not.

defgraft

macro

(defgraft name docstring? tabular->graph-fn)(defgraft name docstring? pipeline template quad-fn*)
Declares an entry point to a graph-generating pipeline allowing it
to be exposed to the Grafter import service and executed via the
leiningen plugin.

It is effectively equivalent to the following call with additional
metadata benefits:

(def my-graft (comp make-graph my-pipeline))

It is used with defpipeline to indicate that a transformation also
supports conversion into graph data.

It takes an optional docstring, if no docstring is specified then a
default docstring will be generated.

defpipe

macro

(defpipe & args)
Declares an entry point to a grafter pipeline, allowing it to be
exposed to the Grafter import service and executed via the leiningen
plugin.

It has the same form as "defn" but adds metadata to the defined
var that lets pipelines be discovered at runtime through both
syntactic and meta-data means.

derive-column

(derive-column dataset new-column-name from-cols)(derive-column dataset new-column-name from-cols f)
Adds a new column to the end of the row which is derived from
column with position col-n.  f should just return the cells value.

If no f is supplied the identity function is used, which results in
the specified column being cloned.

drop-rows

(drop-rows dataset n)
Drops the first n rows from the dataset, retaining the rest.

graph-fn

macro

(graph-fn [row-bindings] & forms)
A macro that defines an anonymous function to convert a tabular
dataset into a graph of RDF quads.  Ultimately it converts a
lazy-seq of rows inside a dataset, into a lazy-seq of RDF
Statements.

The function body should be composed of any number of forms, each of
which should return a sequence of RDF quads.  These will then be
concatenated together into a flattened lazy-seq of RDF statements.

Rows are passed to the function one at a time as hash-maps, which
can be destructured via Clojure's standard destructuring syntax.

Additionally destructuring can be done on row-indicies (when a
vector form is supplied) or column names (when a hash-map form is
supplied).

grep

multimethod

Filters rows in the table for matches.  This is multi-method
dispatches on the type of its second argument.  It also takes any
number of column numbers as the final set of arguments.  These
narrow the scope of the grep to only those columns.  If no columns
are specified then grep operates on all columns.

make-dataset

(make-dataset)(make-dataset data)(make-dataset data columns-or-f)
Like incanter's dataset function except it can take a lazy-sequence
of column names which will get mapped to the source data.

Works by inspecting the amount of columns in the first row, and
taking that many column names from the sequence.

Inspects the first row of data to determine the number of columns,
and creates an incanter dataset with columns named alphabetically as
by grafter.sequences/column-names-seq.

mapc

(mapc dataset fs)
Takes a vector or a hashmap of functions and maps each to the key
column for every row.  Each function should be from a cell to a
cell, where as with apply-columns it should be from a column to a
column i.e. its function from a collection of cells to a collection
of cells.

If the specified column does not exist in the source data a new
column will be created, though the supplied function will need to
either ignore its argument or handle a nil argument.

melt

(melt dataset & pivot-keys)
Melt an object into a form suitable for easy casting, like a melt function in R.
It accepts multiple pivot keys (identifier variables that are reproduced for each
row in the output).
(use '(incanter core charts datasets))
(view (with-data (melt (get-dataset :flow-meter) :Subject)
(line-chart :Subject :value :group-by :variable :legend true)))
See http://www.statmethods.net/management/reshape.html for more examples.

move-first-row-to-header

(move-first-row-to-header [first-row & other-rows])
For use with make-dataset.  Moves the first row of data into the
header, removing it from the source data.

read-dataset

(read-dataset datasetable & {:keys [format], :as opts})
Opens a dataset from a datasetable thing i.e. a filename or an existing Dataset.
The multi-method dispatches based upon a :format option. If this isn't provided then
the type is used. If this isn't provided then we fallback to file extension.

Options are:

  :format - to force the datasetable to be opened with a particular method.

read-datasets

(read-datasets dataset & {:keys [format], :as opts})
Opens a lazy sequence of datasets from a something that returns multiple
datasetables - i.e. all the worksheets in an Excel workbook.

rename-columns

(rename-columns dataset col-map-or-fn)
Renames the columns in the dataset.  Takes either a map or a
function.  If a map is passed it will rename the specified keys to
the corresponding values.

If a function is supplied it will apply the function to all of the
column-names in the supplied dataset.  The return values of this
function will then become the new column names in the dataset
returned by rename-columns.

resolve-column-id

(resolve-column-id dataset column-key)(resolve-column-id dataset column-key not-found)
Finds and resolves the column id by converting between symbols and
strings.  If column-key is not found in the datsets headers then
not-found is returned.

resolve-key-cols

(resolve-key-cols dataset key-cols)
FIXME: write docs

rows

(rows dataset row-numbers & {:as opts})
Takes a dataset and a seq of row-numbers and returns a dataset
consisting of just the supplied rows.  If a row number is not found
the function will assume it has consumed all the rows and return
normally.

swap

(swap dataset first-col second-col)(swap dataset first-col second-col & more)
Takes an even numer of column names and swaps each column

take-rows

(take-rows dataset n)
Takes only the first n rows from the dataset, discarding the rest.

test-dataset

(test-dataset r c)
Constructs a test dataset of r rows by c cols e.g.

(test-dataset 2 2) ;; =>

| A | B |
|---+---|
| 0 | 0 |
| 1 | 1 |

with-metadata-columns

(with-metadata-columns [context data])
Takes a pair of [context, data] and returns a dataset.  Where the
metadata context is merged into the dataset itself.

without-metadata-columns

(without-metadata-columns [context data])
Ignores any possible metadata and leaves the dataset as is.

write-dataset

(write-dataset destination dataset & {:keys [format], :as opts})
FIXME: write docs