raspador¶
O módulo raspador fornece estrutura genérica para extração de dados a partir de arquivos texto semi-estruturados.
Parser¶
-
class
raspador.parser.ParserMetaclass(name, bases, attrs)¶ Collect data-extractors into a field collection and injects ParserMixin.
Campos¶
Fields define how and what data will be extracted. The parser does not expect
the fields explicitly inherit from BaseField, the
minimum expected is that a field has at least a method parse_block.
The fields in this file are based on regular expressions and provide conversion for primitive types in Python.
-
class
raspador.fields.BRFloatField(search=None, default=None, is_list=False, input_processor=None, groups=[])¶ Removes thousand separator and converts to float (Brazilian format)
-
to_python(value)¶
-
-
class
raspador.fields.BaseField(search=None, default=None, is_list=False, input_processor=None, groups=[])¶ Contains processing logic to extract data using regular expressions, and provide utility methods that can be overridden for custom data processing.
Default behavior can be adjusted by parameters:
search
Regular expression that must specify a group of capture. Use parentheses for capturing:
>>> s = "02/01/2013 10:21:51 COO:022734" >>> field = BaseField(search=r'COO:(\d+)') >>> field.parse_block(s) '022734'
The search parameter is the only by position and hence its name can be omitted:
>>> s = "02/01/2013 10:21:51 COO:022734" >>> field = BaseField(r'COO:(\d+)') >>> field.parse_block(s) '022734'
input_processor
Receives a function to handle the captured value before being returned by the field.
>>> s = "02/01/2013 10:21:51 COO:022734" >>> def double(value): ... return int(value) * 2 ... >>> field = BaseField(r'COO:(\d+)', input_processor=double) >>> field.parse_block(s) # 45468 = 2 x 22734 45468
groups
Specify which numbered capturing groups do you want do process in.
You can enter a integer number, as the group index:
>>> s = "Contador de Reduções Z: 1246" >>> field = BaseField(r'Contador de Reduç(ão|ões) Z:\s*(\d+)', groups=1, input_processor=int) >>> field.parse_block(s) 1246
Or a list of integers:
>>> s = "Data do movimento: 02/01/2013 10:21:51" >>> c = BaseField(r'^Data .*(movimento|cupom): (\d+)/(\d+)/(\d+)', groups=[1, 2, 3]) >>> c.parse_block(s) ['02', '01', '2013']
Note
If you do not need the group to capture its match, you can optimize the regular expression putting an ?: after the opening parenthesis:
>>> s = "Contador de Reduções Z: 1246" >>> field = BaseField(r'Contador de Reduç(?:ão|ões) Z:\s*(\d+)') >>> field.parse_block(s)
‘1246’
default
If assigned, theParserwill query this default if no value was returned by the field.is_list
When specified, returns the value as a list:
>>> s = "02/01/2013 10:21:51 COO:022734" >>> field = BaseField(r'COO:(\d+)', is_list=True) >>> field.parse_block(s) ['022734']
By convention, when a field returns a list, the
Parseraccumulates valuesreturned by the field.
-
assign_class(cls, name)¶
-
assign_parser(parser)¶ Receives a weak reference of
Parser
-
parse_block(block)¶
-
search¶
-
to_python(value)¶ Converts parsed data to native python type.
-
-
class
raspador.fields.BooleanField(search=None, default=None, is_list=False, input_processor=None, groups=[])¶ Returns true if the block is matched by Regex, and is at least some value is captured.
-
to_python(value)¶
-
-
class
raspador.fields.DateField(search=None, formato=None, **kwargs)¶ Field that holds data in date format, represented in Python by datetine.date.
-
convertion_function(date)¶
-
default_format_string= '%d/%m/%Y'¶
-
to_python(value)¶
-
-
class
raspador.fields.DateTimeField(search=None, formato=None, **kwargs)¶ Field that holds data in hour/date format, represented in Python by datetine.datetime.
-
convertion_function(date)¶
-
default_format_string= '%d/%m/%Y %H:%M:%S'¶
-
-
class
raspador.fields.FloatField(search=None, default=None, is_list=False, input_processor=None, groups=[])¶ Removes thousand separator and converts to float.
-
to_python(value)¶
-