Feral File Handler: How to handle files that just don’t conform

Within most ETL processes, we will consume files that originated externally of the process and possibly the business. This externality introduces risk to our process where we cannot manually monitor and quality assure what’s received even when it’s only low volumes.

Here at KETL we refer to any of these external files as feral. We cannot guarantee they will comply or even ‘behave’ with our process and cause data errors further downstream before being captured.  This could affect but not limited to:

  • 1. Reporting figures becoming unrealistic
  • 2. Stop ETL processes completing successfully due to data errors
  • 3. Skew expected data values in the end target systems

Due to this unknown factor, we can negate its impact by performing some simple structural validation initially to determine if the file contains expected data and it is in the correct format. This post will describe how to develop a simple Talend job to perform this validation and remove headaches down the line.

Above is a screenshot of how our job will look once completed. By using only 4 components we can design the process to read in 1000’s of files and validate every single row in them against a predefined format.

The Job Breakdown:

listAllCSVFilesInDir (tFileList) – This component allows us to specify an incoming/landing directory on the file system and collect files based on a file mask or file masks. We need to specify the:

i. Directory – Where the files will land on the local file system for the process

ii. FileList Type – Set to Files, though it can be used to locate directories too if required

iii. Files Filemask – Here we can set the file mask to read in the directory, using this as a filter can be set for simple scenarios with wildcards such as only CSV files (*.csv, CustomerFile_*.csv) or more complex regex patterns if you untick “Use Glob Expressions as Filemask”

feralFile(s) (tFileInputDelimited) – By iterating all the found files by the tFileList component, we can read each file individually by row using this component.

  • i. File name/Stream – the file to read, by specifying the tFileList’s property CURRENT_FILEPATH we guarantee the iteration will read the correct file in discrete execution threads and not have data clashes
  • ii. Row Separator – how the rows end, depending on your files this could change but for this example its simple Control Line Feeds (CRLF) or the ‘Enter’ key character which is represented by ‘n’. If you are unsure refer to where the file is created to see what is used
  • iii. Field Separator – what the file is delimited by, some typical examples of this are [,  ;  :  | ‘ ‘]. Again, please refer to the file creation if unsure.
  • iv. Header, Footer, and Limit – Here you can specify how many rows to skip for the header and footer or limit how many are read in by the process. We will leave this as the defaults of 0, 0, and blank for read in a file with no headers, footers or limit.
  • v. Schema – Here we need to specify the expected columns of the job as Strings with no length so the process can always read them in (see second screenshot below).

Note: If on the iterate connection you select the advance settings, you can “Enable parallel execution” to process multiple files at the same time and reduce execution times. Here I’ve set it to 3.

validateStructure (tSchemaComplianceCheck) – This is the most important component of our feral file handling process, it will check the incoming data against a predefined schema and produce errors where matches do not occur. These matches we can then route and handle as required.

i. Mode –

a. “Check all columns from schema” to compare to the incoming schema, but this won’t handle if the file can’t read it initially which we are trying to avoid

b. “Custom defined” to manually specify what each column should be checked for

c. “Use another schema for compliance check” to generate the expected schema values required

We will select option c. and define the expected schema, letting us specify: the datatype; and expected value (nullable TRUE or FALSE); and metadata (date pattern, length, and precision). See attached screenshots for example.

writeErrorLog (tFileOutputDelim) – Finally we need to do something with the error rows, in this example we are going to write them out to an error log file but there’s no reason this couldn’t go to a database or be manipulated/enriched further. The tSchemaCompliance component needs to be connected to this component via a “Rejects” row which includes the columns “errorCode” and “errorMessage” automatically. Here we just need to specify, where the file will be created and again the separators:

i. File Name – Here I’ve used the directory the tFileList was searching within, prepended “ERROR_” to the file, and collected the current iterations filename from the tFileList with CURRENT_FILE again.

ii. Row/Field Separator – This is just down to personal preference as long as the values won’t exist within the data.

The process is now complete! It will collect a file(s) from a directory, read them in on a per row basis and validate that each row matches the expected schema. Any invalid rows will be written to a new file named e.g. ERROR_fileName.csv.

Further Options

This is purely just an example of how to remove headaches further down the process with a truly simple validation step at the start of an ETL process. There is no reason it cannot be developed further such as sending the error log via an email (tSendMail) or copying valid files or error files to another directory (tFileCopy). See screenshot below for this.

Add Comment