raspador¶
Library to extract data from semi-structured text documents.
It’s best suited for data-processing in files that do not have a formal structure and are in plain text (or that are easy to convert).
Parser¶
-
class
raspador.parser.
ParserMetaclass
(name, bases, attrs)¶ Collect data-extractors into a field collection and injects ParserMixin.
Fields¶
Fields define how and what data will be extracted. The parser does not expect
the fields explicitly inherit from BaseField
, the
minimum expected is that a field has at least a method parse_block.
The fields in this file are based on regular expressions and provide conversion for primitive types in Python.
-
class
raspador.fields.
BRFloatField
(search, thousand_separator=None, decimal_separator=None, **kwargs)¶ Removes thousand separator and converts to float (Brazilian format).
Deprecated since version 0.2.2: Use
FloatField
instead.-
default_decimal_separator
= ','¶
-
default_thousand_separator
= '.'¶
-
-
class
raspador.fields.
BaseField
(search=None, default=None, is_list=False, input_processor=None, groups=[])¶ Contains processing logic to extract data using regular expressions, and provide utility methods that can be overridden for custom data processing.
Default behavior can be adjusted by parameters:
search
Regular expression that must specify a group of capture. Use parentheses for capturing:
>>> s = "02/01/2013 10:21:51 COO:022734" >>> field = BaseField(search=r'COO:(\d+)') >>> field.parse_block(s) '022734'
The search parameter is the only by position and hence its name can be omitted:
>>> s = "02/01/2013 10:21:51 COO:022734" >>> field = BaseField(r'COO:(\d+)') >>> field.parse_block(s) '022734'
input_processor
Receives a function to handle the captured value before being returned by the field.
>>> s = "02/01/2013 10:21:51 COO:022734" >>> def double(value): ... return int(value) * 2 ... >>> field = BaseField(r'COO:(\d+)', input_processor=double) >>> field.parse_block(s) # 45468 = 2 x 22734 45468
groups
Specify which numbered capturing groups do you want do process in.
You can enter a integer number, as the group index:
>>> s = "Contador de Reduções Z: 1246" >>> regex = r'Contador de Reduç(ão|ões) Z:\s*(\d+)' >>> field = BaseField(regex, groups=1, input_processor=int) >>> field.parse_block(s) 1246
Or a list of integers:
>>> s = "Data do movimento: 02/01/2013 10:21:51" >>> regex = r'^Data .*(movimento|cupom): (\d+)/(\d+)/(\d+)' >>> c = BaseField(regex, groups=[1, 2, 3]) >>> c.parse_block(s) ['02', '01', '2013']
Note
If you do not need the group to capture its match, you can optimize the regular expression putting an ?: after the opening parenthesis:
>>> s = "Contador de Reduções Z: 1246" >>> field = BaseField(r'Contador de Reduç(?:ão|ões) Z:\s*(\d+)') >>> field.parse_block(s)
‘1246’
default
If assigned, theParser
will query this default if no value was returned by the field.is_list
When specified, returns the value as a list:
>>> s = "02/01/2013 10:21:51 COO:022734" >>> field = BaseField(r'COO:(\d+)', is_list=True) >>> field.parse_block(s) ['022734']
By convention, when a field returns a list, the
Parser
accumulates valuesreturned by the field.
-
assign_class
(cls, name)¶
-
assign_parser
(parser)¶ Receives a weak reference of
Parser
-
parse_block
(block)¶
-
search
¶
-
setup
()¶ Hook to special setup required on child classes
-
to_python
(value)¶ Converts parsed data to native python type.
-
-
class
raspador.fields.
BooleanField
(search=None, default=None, is_list=False, input_processor=None, groups=[])¶ Returns true if the block is matched by Regex, and is at least some value is captured.
-
setup
()¶
-
to_python
(value)¶
-
-
class
raspador.fields.
DateField
(search=None, format_string=None, **kwargs)¶ Field that holds data in date format, represented in Python by datetine.date.
-
convertion_function
(date)¶
-
default_format_string
= '%d/%m/%Y'¶
-
to_python
(value)¶
-
-
class
raspador.fields.
DateTimeField
(search=None, format_string=None, **kwargs)¶ Field that holds data in hour/date format, represented in Python by datetine.datetime.
-
convertion_function
(date)¶
-
default_format_string
= '%d/%m/%Y %H:%M:%S'¶
-
-
class
raspador.fields.
FloatField
(search, thousand_separator=None, decimal_separator=None, **kwargs)¶ Sanitizes captured value according to thousand and decimal separators and converts to float.
-
default_decimal_separator
= '.'¶
-
default_thousand_separator
= ','¶
-
to_python
(value)¶
-