One of the most important OrientDB-ETL module features is the simplicity to configure complex ETL processed, just by working to a single JSON file.
The Configuration file is divided in the following sections:
{
"config": {
<name>: <value>
},
"begin": [
{ <block-name>: { <configuration> } }
],
"source" : {
{ <source-name>: { <configuration> } }
},
"extractor" : {
{ <extractor-name>: { <configuration> } }
},
"transformers" : [
{ <transformer-name>: { <configuration> } }
],
"loader" : { <loader-name>: { <configuration> } },
"end": [
{ <block-name>: { <configuration> } }
]
}
Example:
{
"config": {
"log": "debug",
"fileDirectory": "/temp/databases/dbpedia_csv/",
"fileName": "Person.csv.gz"
},
"begin": [
{ "let": { "name": "$filePath", "value": "$fileDirectory.append( $fileName )"} },
{ "let": { "name": "$className", "value": "$fileName.substring( 0, $fileName.indexOf(".") )"} }
],
"source" : {
"file": { "path": "$filePath", "lock" : true }
},
"extractor" : {
"row": {}
},
"transformers" : [
{ "csv": { "separator": ",", "nullValue": "NULL", "skipFrom": 1, "skipTo": 3 } },
{ "merge": { "joinFieldName":"URI", "lookup":"V.URI" } },
{ "vertex": { "class": "$className"} }
],
"loader" : {
"orientdb": {
"dbURL": "plocal:/temp/databases/dbpedia",
"dbUser": "admin",
"dbPassword": "admin",
"dbAutoCreate": true,
"tx": false,
"batchCommit": 1000,
"dbType": "graph",
"indexes": [{"class":"V", "fields":["URI:string"], "type":"UNIQUE" }]
}
}
}
$input is the context variable assigned before each transformation={<expression>}, example: ={eval('3 * 5')}All executable blocks, like Transformers and Blocks, can be executed only if a condition is true by using the if conditional expression using the OrientDB SQL syntax. Example:
{ "let": {
"name": "path",
"value": "C:/Temp",
"if": "${os.name} = 'Windows'"
}
},
{ "let": {
"name": "path",
"value": "/tmp",
"if": "${os.name}.indexOf('nux')"
}
}
`
Most of the blocks, like Transformers and Blocks, supports the log setting. Log can be one of the following values (case insensitive): [NONE, ERROR, INFO, DEBUG]. By default is INFO.
Set the log level to DEBUG to display more information on execution. Remember that logging slows down execution, so use it only for development and debug purpose. Example:
{ "http": {
"url": "http://ip.jsontest.com/",
"method": "GET",
"headers": {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"
},
"log": "DEBUG"
}
}
All the variables declared in "config" block are bound in the execution context and can be used by ETL processing.
There are also special variables used by ETL process:
| Variable | Description | Type | Mandatory | Default value |
|---|---|---|---|---|
| log | Global "log" setting. Accepted values: [NONE, ERROR, INFO, DEBUG]. Useful to debug a ETL process or single component. |
string | false | INFO |
| maxRetries | Maximum number of retries in case the loader raises a ONeedRetryException: concurrent modification of the same records | integer | false | 10 |
| parallel | Executes pipelines in parallel by using all the available cores | boolean | false | false |