mwparserfromhell Package

mwparserfromhell Package

mwparserfromhell (the MediaWiki Parser from Hell) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki wikicode.

definitions Module

Contains data about certain markup, like HTML tags and external links.

When updating this file, please also update the the C tokenizer version: - mwparserfromhell/parser/ctokenizer/definitions.c - mwparserfromhell/parser/ctokenizer/definitions.h

mwparserfromhell.definitions.get_html_tag(markup)[source]

Return the HTML tag associated with the given wiki-markup.

mwparserfromhell.definitions.is_parsable(tag)[source]

Return if the given tag’s contents should be passed to the parser.

mwparserfromhell.definitions.is_scheme(scheme, slashes=True)[source]

Return whether scheme is valid for external links.

mwparserfromhell.definitions.is_single(tag)[source]

Return whether or not the given tag can exist without a close tag.

mwparserfromhell.definitions.is_single_only(tag)[source]

Return whether or not the given tag must exist without a close tag.

mwparserfromhell.definitions.is_visible(tag)[source]

Return whether or not the given tag contains visible text.

string_mixin Module

This module contains the StringMixIn type, which implements the interface for the str type in a dynamic manner.

class mwparserfromhell.string_mixin.StringMixIn[source]

Implement the interface for str in a dynamic manner.

To use this class, inherit from it and override the __str__() method to return the string representation of the object. The various string methods will operate on the value of __str__() instead of the immutable self like the regular str type.

maketrans()

Return a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters to Unicode ordinals, strings or None. Character keys will be then converted to ordinals. If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

utils Module

This module contains accessory functions for other parts of the library. Parser users generally won’t need stuff from here.

mwparserfromhell.utils.parse_anything(value: Any, context: int = 0, *, skip_style_tags: bool = False) Wikicode[source]

Return a Wikicode for value, allowing multiple types.

This differs from Parser.parse() in that we accept more than just a string to be parsed. Strings, bytes, integers (converted to strings), None, existing Node or Wikicode objects, as well as an iterable of these types, are supported. This is used to parse input on-the-fly by various methods of Wikicode and others like Template, such as wikicode.insert() or setting template.name.

Additional arguments are passed directly to Parser.parse().

wikicode Module

class mwparserfromhell.wikicode.Wikicode(nodes)[source]

Bases: StringMixIn

A Wikicode is a container for nodes that operates like a string.

Additionally, it contains methods that can be used to extract data from or modify the nodes, implemented in an interface similar to a list. For example, index() can get the index of a node in the list, and insert() can add a new node at that index. The filter() series of functions is very useful for extracting and iterating over, for example, all of the templates in the object.

RECURSE_OTHERS = 2
append(value)[source]

Insert value at the end of the list of nodes.

value can be anything parsable by parse_anything().

contains(obj)[source]

Return whether this Wikicode object contains obj.

If obj is a Node or Wikicode object, then we search for it exactly among all of our children, recursively. Otherwise, this method just uses __contains__() on the string.

filter(*args, **kwargs)[source]

Return a list of nodes within our list matching certain conditions.

This is equivalent to calling list() on ifilter().

filter_arguments(*a, **kw)

Iterate over arguments.

This is equivalent to filter() with forcetype set to Argument.

filter_comments(*a, **kw)

Iterate over comments.

This is equivalent to filter() with forcetype set to Comment.

Iterate over external_links.

This is equivalent to filter() with forcetype set to ExternalLink.

filter_headings(*a, **kw)

Iterate over headings.

This is equivalent to filter() with forcetype set to Heading.

filter_html_entities(*a, **kw)

Iterate over html_entities.

This is equivalent to filter() with forcetype set to HTMLEntity.

filter_tags(*a, **kw)

Iterate over tags.

This is equivalent to filter() with forcetype set to Tag.

filter_templates(*a, **kw)

Iterate over templates.

This is equivalent to filter() with forcetype set to Template.

filter_text(*a, **kw)

Iterate over text.

This is equivalent to filter() with forcetype set to Text.

Iterate over wikilinks.

This is equivalent to filter() with forcetype set to Wikilink.

get(index)[source]

Return the indexth node within the list of nodes.

get_ancestors(obj)[source]

Return a list of all ancestor nodes of the Node obj.

The list is ordered from the most shallow ancestor (greatest great- grandparent) to the direct parent. The node itself is not included in the list. For example:

>>> text = "{{a|{{b|{{c|{{d}}}}}}}}"
>>> code = mwparserfromhell.parse(text)
>>> node = code.filter_templates(matches=lambda n: n == "{{d}}")[0]
>>> code.get_ancestors(node)
['{{a|{{b|{{c|{{d}}}}}}}}', '{{b|{{c|{{d}}}}}}', '{{c|{{d}}}}']

Will return an empty list if obj is at the top level of this Wikicode object. Will raise ValueError if it wasn’t found.

get_parent(obj)[source]

Return the direct parent node of the Node obj.

This function is equivalent to calling get_ancestors() and taking the last element of the resulting list. Will return None if the node exists but does not have a parent; i.e., it is at the top level of the Wikicode object.

get_sections(levels=None, matches=None, flags=RegexFlag.IGNORECASE | UNICODE | DOTALL, flat=False, include_lead=None, include_headings=True)[source]

Return a list of sections within the page.

Sections are returned as Wikicode objects with a shared node list (implemented using SmartList) so that changes to sections are reflected in the parent Wikicode object.

Each section contains all of its subsections, unless flat is True. If levels is given, it should be a iterable of integers; only sections whose heading levels are within it will be returned. If matches is given, it should be either a function or a regex; only sections whose headings match it (without the surrounding equal signs) will be included. flags can be used to override the default regex flags (see ifilter()) if a regex matches is used.

If include_lead is True, the first, lead section (without a heading) will be included in the list; False will not include it; the default will include it only if no specific levels were given. If include_headings is True, the section’s beginning Heading object will be included; otherwise, this is skipped.

get_tree()[source]

Return a hierarchical tree representation of the object.

The representation is a string makes the most sense printed. It is built by calling _get_tree() on the Wikicode object and its children recursively. The end result may look something like the following:

>>> text = "Lorem ipsum {{foo|bar|{{baz}}|spam=eggs}}"
>>> print(mwparserfromhell.parse(text).get_tree())
Lorem ipsum
{{
      foo
    | 1
    = bar
    | 2
    = {{
            baz
      }}
    | spam
    = eggs
}}
ifilter(recursive=True, matches=None, flags=RegexFlag.IGNORECASE | UNICODE | DOTALL, forcetype=None)[source]

Iterate over nodes in our list matching certain conditions.

If forcetype is given, only nodes that are instances of this type (or tuple of types) are yielded. Setting recursive to True will iterate over all children and their descendants. RECURSE_OTHERS will only iterate over children that are not the instances of forcetype. False will only iterate over immediate children.

RECURSE_OTHERS can be used to iterate over all un-nested templates, even if they are inside of HTML tags, like so:

>>> code = mwparserfromhell.parse("{{foo}}<b>{{foo|{{bar}}}}</b>")
>>> code.filter_templates(code.RECURSE_OTHERS)
["{{foo}}", "{{foo|{{bar}}}}"]

matches can be used to further restrict the nodes, either as a function (taking a single Node and returning a boolean) or a regular expression (matched against the node’s string representation with re.search()). If matches is a regex, the flags passed to re.search() are re.IGNORECASE, re.DOTALL, and re.UNICODE, but custom flags can be specified by passing flags.

ifilter_arguments(*a, **kw)

Iterate over arguments.

This is equivalent to ifilter() with forcetype set to Argument.

ifilter_comments(*a, **kw)

Iterate over comments.

This is equivalent to ifilter() with forcetype set to Comment.

Iterate over external_links.

This is equivalent to ifilter() with forcetype set to ExternalLink.

ifilter_headings(*a, **kw)

Iterate over headings.

This is equivalent to ifilter() with forcetype set to Heading.

ifilter_html_entities(*a, **kw)

Iterate over html_entities.

This is equivalent to ifilter() with forcetype set to HTMLEntity.

ifilter_tags(*a, **kw)

Iterate over tags.

This is equivalent to ifilter() with forcetype set to Tag.

ifilter_templates(*a, **kw)

Iterate over templates.

This is equivalent to ifilter() with forcetype set to Template.

ifilter_text(*a, **kw)

Iterate over text.

This is equivalent to ifilter() with forcetype set to Text.

Iterate over wikilinks.

This is equivalent to ifilter() with forcetype set to Wikilink.

index(obj, recursive=False)[source]

Return the index of obj in the list of nodes.

Raises ValueError if obj is not found. If recursive is True, we will look in all nodes of ours and their descendants, and return the index of our direct descendant node within our list of nodes. Otherwise, the lookup is done only on direct descendants.

insert(index, value)[source]

Insert value at index in the list of nodes.

value can be anything parsable by parse_anything(), which includes strings or other Wikicode or Node objects.

insert_after(obj, value, recursive=True)[source]

Insert value immediately after obj.

obj can be either a string, a Node, or another Wikicode object (as created by get_sections(), for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. value can be anything parsable by parse_anything(). If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not found, ValueError is raised.

insert_before(obj, value, recursive=True)[source]

Insert value immediately before obj.

obj can be either a string, a Node, or another Wikicode object (as created by get_sections(), for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. value can be anything parsable by parse_anything(). If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not found, ValueError is raised.

matches(other)[source]

Do a loose equivalency test suitable for comparing page names.

other can be any string-like object, including Wikicode, or an iterable of these. This operation is symmetric; both sides are adjusted. Specifically, whitespace and markup is stripped and the first letter’s case is normalized. Typical usage is if template.name.matches("stub"): ....

property nodes

A list of Node objects.

This is the internal data actually stored within a Wikicode object.

remove(obj, recursive=True)[source]

Remove obj from the list of nodes.

obj can be either a string, a Node, or another Wikicode object (as created by get_sections(), for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not found, ValueError is raised.

replace(obj, value, recursive=True)[source]

Replace obj with value.

obj can be either a string, a Node, or another Wikicode object (as created by get_sections(), for example). If obj is a string, we will operate on all instances of that string within the code, otherwise only on the specific instance given. value can be anything parsable by parse_anything(). If recursive is True, we will try to find obj within our child nodes even if it is not a direct descendant of this Wikicode object. If obj is not found, ValueError is raised.

set(index, value)[source]

Set the Node at index to value.

Raises IndexError if index is out of range, or ValueError if value cannot be coerced into one Node. To insert multiple nodes at an index, use get() with either remove() and insert() or replace().

strip_code(normalize=True, collapse=True, keep_template_params=False)[source]

Return a rendered string without unprintable code such as templates.

The way a node is stripped is handled by the __strip__() method of Node objects, which generally return a subset of their nodes or None. For example, templates and tags are removed completely, links are stripped to just their display part, headings are stripped to just their title.

If normalize is True, various things may be done to strip code further, such as converting HTML entities like &Sigma;, &#931;, and &#x3a3; to Σ. If collapse is True, we will try to remove excess whitespace as well (three or more newlines are converted to two, for example). If keep_template_params is True, then template parameters will be preserved in the output (normally, they are removed completely).

Subpackages