Integrated Custom File Processors
The features in this article were added in Data Import 1.2. They enhance and extend prior versions' Integrated Custom Record Processing and Integrated Custom File Generation features.
Overview
Data Import is not a catch-all extract-transform-load (ETL) tool as one may find with commercial applications. Boomi, Microsoft Biztalk, Oracle Weblogic and Talend are examples of ETL applications (ranging from open source to on-premises to SaaS) that may be used to accommodate loading of legacy data sources into the Ed-Fi ODS / API.
Data Import is a data mapping and data loading tool for the ODS / API. CSV fields are mapped to ODS / API fields, so that a CSV input file can be iterated over and POST-ed to the ODS / API resource endpoints. Still, there are some ETL-style needs which present as obstacles to simple mapping. Data Import includes PowerShell preprocessor support in service of these needs.
Before opting into this feature, consider: is your scenario an aspect of data mapping itself, or an ETL process better served prior to mapping?
Examples of mapping challenges served by PowerShell preprocessors:
- Simple 'clean up' of individual values such as trimming excess space or reformatting dates.
- Importing from non-CSV files (tab-delimited, fixed-width, ...).
- Reshaping of the row 'grain' of the input file (one row becomes multiple rows or vice-versa).
- Skipping input rows (e.g. student assessment rows that indicate the student was not tested).
- Performing API calls against the target ODS / API.
3 Kinds of Preprocessor Scripts
Prior versions of Data Import introduced support for 2 kinds of PowerShell preprocessor scripts:
- Custom Row Processors
- Optionally selected on the Agent screen.
- The script is invoked once per input row, with the opportunity to inspect and modify each row one at a time.
- As an aspect of your Agent, rather than as an aspect of a Data Map, these are not shared.
- As a general rule, we recommend upgrading these to work with the new "Custom File Processor" scripts described below, for additional flexibility, sharing, and diagnostic logging.
- Custom File Generators
- Optionally selected as the type of Agent on the Agent screen. Rather than working with a preexisting file, the Agent produces its own file with a script and then proceeds with mapping/importing.
- These tend to require exceptional access to the server's capabilities in order to perform database queries and to access the filesystem directly.
- Using a File Generator is a strong indicator that you are mixing Data Import's data mapping role with the ETL role of other systems. Consider: does Data Import need to know about your bespoke ETL scripting prior to mapping and importing?
Data Import adds a third type of preprocessor script:
- Custom File Processors
- Optionally selected on the Data Map screen.
- As they are an aspect of a Data Map, these are naturally shared with Data Maps via Import / Export and the Template Sharing Service.
- Can perform API calls against the ODS / API during data load. Data Import manages your Key and Secret, as well as Access Tokens, to ease access to the API.
- Enhanced logging support: scripts can log diagnostic information during execution, useful when debugging scripts.
PowerShell "Sandboxing" Security
Data Import is not a programming language. Data Import is not an ETL tool. Data Import is a data mapping tool which posts to the ODS / API. As of Data Import 1.2, preprocessor scripts run in a safe-by-default PowerShell "sandbox". The core language is available, as well as many common built-in operations and ODS / API specific commands, but all potentially-dangerous operations are locked down by default. A malicious script author, for instance, would want to cause harm by accessing the filesystem directly, by attempting to connect to databases on the local network, etc. The sandbox is safe by default, so all such attempts would rightly fail for them at runtime.
This sandboxing safety can naturally pose a problem when your
script needs to do something like access the filesystem or query
databases. File
Generators, for
instance, perform exactly that kind of work by their very nature. This is a
good reason to question whether such an operation belongs inside a data
mapping step within a data mapping tool, rather than inside an ETL process
prior to mapping with Data Import. If an administrator wants to opt-in to
the risks of arbitrary code running in their data maps, performing file
access, database connections, and the like, they can enable full access to the
PowerShell language and commands: in both the DataImport.Web/Web.config
file
and the
DataImport.Server.TransformLoad/DataImport.Server.TransformLoad.exe.config
file, locate the <appSettings>
tag and add the following setting to enable
full PowerShell access:
<add key="UsePowerShellWithNoRestrictions" value="True" />
Custom File Processors - Examples
The rest of this page focuses on modern Custom File Processors. For more on Row Processors and File Generators, see:
All of the examples in this page are based on the Student Assessments example described in the [Quick Start].
All of the example templates and example input files are in the attached zip. The *.json template files can all be imported into a test Data Import installation, and then tested using Manual Agents with the sample *.csv and *.txt files enclosed in the same zip.
They are all intended to work against an ODS with the "Grand Bend" sample data as a prerequisite, such as for the public instances hosted on https://api.ed-fi.org/:
After importing the sample *.json templates found in the attached zip file, you'll have several similar Data Maps, Bootstraps, and most interestingly Preprocessors:
Example: Tab-Delimited Files
Template to Import: ODS 5.2 Student Assessments - Tab Delimited.json File to Upload: DI-QA-STUDENT-ASSESSMENTS-2018-GrandBend - Tab Delimited.txt
Let's start simple. We have a file where the individual values are not separated by commas, but instead by wide 'tab' characters. See for yourself: open the file in a simple text editor like Windows' Notepad and move along the contents of each line. You'll find that the gaps between values are not multiple spaces, but instead single wide tab characters.
By default, Data Import works with Comma-Separated Value (CSV) files, so on its own, this file will be meaningless to Data Import. However, the Data Map in this template includes its own Preprocessor to convert each line from being tab-delimited to a valid CSV:
[CmdletBinding()]
Param(
[Parameter(Mandatory = $true, ValueFromPipeline = $true)][string]$line
)
Begin {
Write-Information "Converting tabs to commas"
}
Process {
Write-Output $line.Replace("`t", ",")
}
End {
Write-Information "Finished converting tabs to commas"
}
Here, we see the basic structure of a 'Custom File Processor' script. Custom File Processors begin like so (it's best to copy this directly to avoid typos and get off to a good start!):
[CmdletBinding()]
Param(
[Parameter(Mandatory = $true, ValueFromPipeline = $true)][string]$line
)
This opening set of lines indicates that we will be receiving and processing the
original tab-delimited file one line at a time, and in the rest of our script we
will be able to refer to the current line using the variable $line
.
Next, we see the script is made of three blocks: Begin, Process, and End. As a
PowerShell Advanced
Function,
the instructions in the Begin
and **End**
blocks each run once at the
start and end of the execution. The **Process**
block will run many times,
once per line (accessed via **$line**
).
You can output diagnostic information with built-in PowerShell commands
**Write-Information**
and **Write-Host**
. You can report errors either by
using **throw**
or by using the build-in PowerShell command **Write-Error**
.
We'll see more on error handling in another example below.
The core logic in our script, then, is the single line within **Process**
:
take the original input file line, replace all tabs ("`t" in PowerShell
terminology) with commas and then output the result.
While the other **Write-***
commands output mere diagnostic information to the
log at runtime, **Write-Output**
is special. Anything we output with
**Write-Output**
indicates the new contents of the resulting CSV file to be
processed, mapped, and POST-ed to the ODS / API.
Example: Fixed-Width Files
Template to Import: ODS 5.2 Student Assessments - Fixed Width.json File to Upload: DI-QA-STUDENT-ASSESSMENTS-2018-GrandBend - Fixed Width.txt
In this example, we deal with a "Fixed Width Field" file. Although strongly discouraged as difficult to work with reliably, some third-party systems export their data as files where each field is understood to exist at certain specific (fixed) character positions. For instance, such a file might be understood to have a field "Student Name" beginning at exactly the 134th character in a line, extending to exactly the 151st character of the line, etc.
Open this sample *.txt file and contrast it with the tab-delimited file of the
previous example. Note that this time there are no wide tab characters. Instead,
every field is carefully spaced out with individual spaces to fit a particular
predefined understanding of where each field begins and ends. The only way to
make sense of this during import is to explicitly chop up each line based on
the exact start position of each expected field. This can be quite painful to
do oneself, so Data Import provides a convenient command
**ConvertFrom-FixedWidth**