Thoughts, information and reflections about technology

Validating CSV imports in Rails

I needed to validate CSV files that were being imported into a Rails app. The validation was simple: Look for blank entries, ensure columns used for associations (IDs) were integers and check that certain other entries were numeric.

The first thing I checked was to make sure that the header of the CSV file matched what we expected. The simplest way to do this was to set an ‘expected header’ and then check the header on the CSV.

Here is a stripped down example for a hypothetical employee list. In the controller, I would  have :

 expected_header = [“EmployeeNo”, “First”, “Last”, “Bonus”]
 csv_error = true if !helpers.check_header(expected_header,csv_file)

We are using Rails 5 so we can use helpers. to call a method in a helper for the controller.
In the helper, we put

def check_header(expected_header,csv_file)
header = CSV.open(csv_file, ‘r’) { |csv| csv.first }
valid_csv = true
for i in (0..header.size-1)
if header[i].downcase != expected_header[i].downcase
valid_csv = false
end
end

if !valid_csv
$csv_error = “Header:
#{header}
Expected Header:
#{expected_header} ”
end
return valid_csv
end

A couple of notes on the text below
  • We have already set csv_file in the controller.
  • We are grabbing the first line of the csv which should be the header.
  • We walk through each header.  .downcase is used because there were some differences in case from different sources.
  • we compare the expected header with the header and return valid_csv if it matches
The next lines in the controller are as follows. We are checking to see if we got an error checking the header. If not, then we walk through the data.

            if !csv_error
CSV.foreach(csv_file, :headers => true) do |row|
csv_error = true if !helpers.check_employee_csv(row)
end
end  

The method called in the helper is something like

 

def check_lists_csv(row)
ok = true
ok = false unless data_is_present(row, 0, “EmployeeID)
ok = false unless data_is_integer(row, 0, “EmployeeID)
ok = false unless data_is_present(row, 1, “First”)
ok = false unless data_is_present(row, 2, “Last”)
ok = false unless data_is_present(row, 3, “Bonus)
ok = false unless_data_is_numeric(row, 3, “Bonus”)
return ok
end

Notes on above
  • row is passed from the controller and contains the line of data from the csv
  • the second parameter is the index of the column. The column numbers begin with 0
  • the third parameter is the name of the column for the error message.
Here are the methods that check for blank, integer and numeric in the helper

def data_is_numeric(row, column_number, column_name)
data_numeric = true
if row[column_number.to_i].to_s.match(/A[+-]?d+?(.d+)?Z/) == nil
$csv_error = $csv_error + “Error in row #{$INPUT_LINE_NUMBER} >> #{row} >>  #{column_name} is not numeric

data_numeric = true
end
return data_numeric
end

 

def data_is_integer(row, column_number, column_name)
data_integer = true
if row[column_number.to_i].to_s.match(/^(0|[1-9][0-9]*)$/) == nil
$csv_error = $csv_error + “Error in row #{$INPUT_LINE_NUMBER} >> #{row} >>  #{column_name} is not an integer

data_integer = true
end
return data_integer
end

 

def data_is_present(row, column_number, column_name)
data_present = true
if row[column_number.to_i].blank?
$csv_error = $csv_error + “Error in row #{$INPUT_LINE_NUMBER} >> #{row} >>  #{column_name} is blank

data_present = false
end
return data_present
end

If the csv validations pass, then we process the csv
           if !csv_error
              CSV.foreach(csv_file, :headers => true) do |row|
                update_focus(row)
              end
              file = “Employee”
            else
              redirect_to file_error_path and return
              $csv_error = $csv_error + “Your CSV was NOT uploaded  Please fix the error(s) and retry”
            end
NOTE – I know that there are problems with global variables but they work for the app.
The file error view is simply

 

There was an error in the file upload
The file you were attempting to upload was not replaced

<%=  $csv_error.html_safe %>


This works and requires about 3 seconds to process about 40k of records.

My first attempt was taking about 8 minutes to process 40k of records. The problem was that I was trying to address variations in case with something like

 

ok = false unless data_is_present(row, “EmployeeID”) || data_is_present(row, “Employeeid”)

We had CSVs with both versions of the header. For some reason, that was taking a lot of time to process. By using the index number of the header, it avoided whatever was causing the slow processing.

Similar Posts:

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact me
Archives
Categories
Amazon Disclosure

We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.