Module: EncodingDetector
- Defined in:
- lib/encoding_detector.rb
Overview
Detect encodings of arbitrary strings for later conversion.
This is a naive implementation that works by trying to encode the string to UTF-8 using a list of known encodings, and scoring the results based on the number of invalid characters and total length of the resulting string. The shorter the resulting string, the more likely it is to be valid. Invalid characters are penalised more heavily to avoid false positives and assist with UTF-16 detection (which produces shorter strings).
Constant Summary collapse
- ENCODINGS =
A later encoding will ONLY be chosen IF it scores better than an earlier one
%w[UTF-8 ISO-8859-1 UTF-16].freeze
- MAX_SCORE =
Float::INFINITY
- INVALID_CHARACTER =
'�'
Class Method Summary collapse
-
.convert_to_utf8(content) ⇒ String
Convert the given content to UTF-8, using the detected encoding if possible.
-
.detect(content) ⇒ Hash
Detect the encoding of the given content from the list of known encodings.
- .score_encoding(content, encoding) ⇒ Object
- .test_encoding(content, encoding) ⇒ Object
Class Method Details
.convert_to_utf8(content) ⇒ String
Convert the given content to UTF-8, using the detected encoding if possible. If the encoding cannot be detected, invalid characters will be replaced with the unicode replacement character (�).
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/encoding_detector.rb', line 68 do # NOTE: The following attribute is not required for Microarray Genotyping. # I think this might be broken and suggests that there should be separate classes for project: one for # next-gen sequencing that includes this attribute in it's metadata, and one for microarray genotyping # that doesn't. include ProjectManager::Associations include BudgetDivision::Associations custom_attribute(:project_cost_code, required: true) custom_attribute(:funding_comments) custom_attribute(:collaborators) custom_attribute(:external_funding_source) custom_attribute(:sequencing_budget_cost_centre) custom_attribute(:project_funding_model, in: PROJECT_FUNDING_MODELS) custom_attribute(:gt_committee_tracking_id) before_validation do |record| record.project_cost_code = nil if record.project_cost_code.blank? record.project_funding_model = nil if record.project_funding_model.blank? end end |
.detect(content) ⇒ Hash
Detect the encoding of the given content from the list of known encodings.
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/encoding_detector.rb', line 22 do # NOTE: The following attribute is not required for Microarray Genotyping. # I think this might be broken and suggests that there should be separate classes for project: one for # next-gen sequencing that includes this attribute in it's metadata, and one for microarray genotyping # that doesn't. include ProjectManager::Associations include BudgetDivision::Associations custom_attribute(:project_cost_code, required: true) custom_attribute(:funding_comments) custom_attribute(:collaborators) custom_attribute(:external_funding_source) custom_attribute(:sequencing_budget_cost_centre) custom_attribute(:project_funding_model, in: PROJECT_FUNDING_MODELS) custom_attribute(:gt_committee_tracking_id) before_validation do |record| record.project_cost_code = nil if record.project_cost_code.blank? record.project_funding_model = nil if record.project_funding_model.blank? end end |
.score_encoding(content, encoding) ⇒ Object
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# File 'lib/encoding_detector.rb', line 31 do # NOTE: The following attribute is not required for Microarray Genotyping. # I think this might be broken and suggests that there should be separate classes for project: one for # next-gen sequencing that includes this attribute in it's metadata, and one for microarray genotyping # that doesn't. include ProjectManager::Associations include BudgetDivision::Associations custom_attribute(:project_cost_code, required: true) custom_attribute(:funding_comments) custom_attribute(:collaborators) custom_attribute(:external_funding_source) custom_attribute(:sequencing_budget_cost_centre) custom_attribute(:project_funding_model, in: PROJECT_FUNDING_MODELS) custom_attribute(:gt_committee_tracking_id) before_validation do |record| record.project_cost_code = nil if record.project_cost_code.blank? record.project_funding_model = nil if record.project_funding_model.blank? end end |
.test_encoding(content, encoding) ⇒ Object
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
# File 'lib/encoding_detector.rb', line 52 do # NOTE: The following attribute is not required for Microarray Genotyping. # I think this might be broken and suggests that there should be separate classes for project: one for # next-gen sequencing that includes this attribute in it's metadata, and one for microarray genotyping # that doesn't. include ProjectManager::Associations include BudgetDivision::Associations custom_attribute(:project_cost_code, required: true) custom_attribute(:funding_comments) custom_attribute(:collaborators) custom_attribute(:external_funding_source) custom_attribute(:sequencing_budget_cost_centre) custom_attribute(:project_funding_model, in: PROJECT_FUNDING_MODELS) custom_attribute(:gt_committee_tracking_id) before_validation do |record| record.project_cost_code = nil if record.project_cost_code.blank? record.project_funding_model = nil if record.project_funding_model.blank? end end |