Use a pattern that matches the type of temporal value you expect. Be sure to consider issues such as how strict to be about delimiters between subparts and the lengths of the subparts.
Dates are a validation headache because they come in so many formats. Pattern tests are extremely useful for weeding out illegal values, but often insufficient for full verification: a date might have a number where you expect a month, but if the number is 13, the date isn’t valid. This section introduces some patterns that match a few common date formats. Performing Validity Checking on Date or Time Subparts revisits this topic in more detail and discusses how to combine pattern tests with content verification.
To require values to be dates in ISO (CCYY-MM-DD
) format, use
this pattern:
/^d{4}-d{2}-d{2}$/
The pattern requires the
-
character as the
delimiter between date parts. To allow either -
or /
as
the delimiter, use a character class between the numeric parts (the
slashes are escaped with a backslash to prevent them from being
interpreted as the end of the pattern constructor):
/^d{4}[-/]d{2}[-/]d{2}$/
Or you can use a different delimiter around the pattern and avoid the backslashes:
m|^d{4}[-/]d{2}[-/]d{2}$|
To allow any non-digit delimiter (which corresponds to how MySQL operates when it interprets strings as dates), use this pattern:
/^d{4}Dd{2}Dd{2}$/
If you don’t require the full number of digits in each part (to
allow leading zeros in values like 03
to be missing, for example), just look
for three nonempty digit sequences:
/^d+Dd+Dd+$/
Of course, that pattern is so general that it will also match other values such as U.S. Social Security numbers (which have the format 012-34-5678). To constrain the subpart lengths by requiring two to four digits in the year part and one or two digits in the month and day parts, use this pattern:
/^d{2,4}?Dd{1,2}Dd{1,2}$/
For dates in other formats such as
MM-DD-YY
or
DD-MM-YY
, similar patterns apply, but the
subparts are arranged in a different order. This pattern matches both
of those formats:
/^d{2}-d{2}-d{2}$/
If you need to check the values of individual date parts, use parentheses in the pattern and extract the substrings after a successful match. If you’re expecting dates to be in ISO format, for example, do something like this:
if ($val =~ /^(d{2,4})D(d{1,2})D(d{1,2})$/) { ($year, $month, $day) = ($1, $2, $3); }
The library file lib/Cookbook_Utils.pm in the recipes
distribution contains several of
these pattern tests, packaged as function calls. If the date doesn’t
match the pattern, they return undef
. Otherwise, they return a reference to
an array containing the broken-out values for the year, month, and
day. This can be useful for performing further checking on the
components of the date. For example, is_iso_date()
looks for dates that
match ISO format. It’s defined as follows:
sub is_iso_date { my $s = $_[0]; return undef unless $s =~ /^(d{2,4})D(d{1,2})D(d{1,2})$/; return [ $1, $2, $3 ]; # return year, month, day }
To use the function, do something like this:
my $ref = is_iso_date ($val); if (defined ($ref)) { # $val matched ISO format pattern; # check its subparts using $ref->[0] through $ref->[2] } else { # $val didn't match ISO format pattern }
You’ll often find additional processing necessary with dates, because although date-matching patterns help to weed out values that are syntactically malformed, they don’t assess whether the individual components contain legal values. To do that, some range checking is necessary. That topic is covered later in Performing Validity Checking on Date or Time Subparts.
If you’re willing to skip subpart testing and just want to
rewrite the pieces, you can use a substitution. For example, to
rewrite values assumed to be in MM-DD-YY
format into YY-MM-DD
format, do
this:
$val =~ s/^(d+)D(d+)D(d+)$/$3-$1-$2/;
Time values are somewhat more orderly than dates, usually being written with hours first and seconds last, with two digits per part:
/^d{2}:d{2}:d{2}$/
To be more lenient, you can allow the hours part to have a single digit, or the seconds part to be missing:
/^d{1,2}:d{2}(:d{2})?$/
You can mark parts of the time with parentheses if you want to
range-check the individual parts, or perhaps to reformat the value to
include a seconds part of 00
if it
happens to be missing. However, this requires some care with the
parentheses and the ?
characters in
the pattern if the seconds part is optional. You want to allow the
entire :d{2}
at the end of the
pattern to be optional, but not to save the :
character in $3
if the third time section is present. To
accomplish that, use (?:
pat
)
, an alternative grouping notation that
doesn’t save the matched substring. Within that notation, use
parentheses around the digits to save them. Then $3
will be undef
if the seconds part is not present,
but will contain the seconds digits otherwise:
if ($val =~ /^(d{1,2}):(d{2})(?::(d{2}))?$/) { my ($hour, $min, $sec) = ($1, $2, $3); $sec = "00" if !defined ($sec); # seconds missing; use 00 $val = "$hour:$min:$sec"; }
To rewrite times from 12-hour format with AM and PM suffixes to 24-hour format, you can do something like this:
if ($val =~ /^(d{1,2}):(d{2})(?::(d{2}))?s*(AM|PM)?$/i) { my ($hour, $min, $sec) = ($1, $2, $3); # supply missing seconds $sec = "00" unless defined ($sec); if ($hour == 12 && (!defined ($4) || uc ($4) eq "AM")) { $hour = "00"; # 12:xx:xx AM times are 00:xx:xx } elsif ($hour < 12 && defined ($4) && uc ($4) eq "PM") { $hour += 12; # PM times other than 12:xx:xx } $val = "$hour:$min:$sec"; }
The time parts are placed into $1
, $2
,
and $3
, with $3
set to undef
if the seconds part is missing. The
suffix goes into $4
if it’s
present. If the suffix is AM
or
missing (undef
), the value is
interpreted as an AM time. If the suffix is PM
, the value is interpreted as a PM
time.
This section is just the beginning of what you can do when processing dates for data-transfer purposes. Date and time testing and conversion can be highly idiosyncratic, and the sheer number of issues to consider is mind-boggling:
What is the basic date format? Dates come in several common
styles, such as ISO (CCYY-MM-DD
), U.S.
(MM-DD-YY
), and British
(DD-MM-YY
) formats. And these are just
some of the more standard formats. Many more are possible. For
example, a datafile may contain dates written as June
17,
1959
or as 17
Jun
'59
.
Are trailing times allowed on dates or perhaps required? When times are expected, is the full time required or just the hour and minute?
Do you allow special values like now
or today
?
Are date parts required to be delimited by a certain
character, such as -
or
/
, or are other delimiters
allowed?
Are date parts required to have a specific number of digits? Or are leading zeros on month and year values allowed to be missing?
Are months written numerically, or are they represented as
month names like January
or
Jan
?
Are two-digit year values allowed? Should they be converted
to have four digits? If so, what is the conversion rule? (What is
the transition point within the range 00
to 99
at which values change from one
century to another?)
Should date parts be checked to ensure their validity?
Patterns can recognize strings that look like dates or times, but
while they’re extremely useful for detecting malformed values,
they may not be sufficient. A value like 1947-15-99
may match a pattern but isn’t
a legal date. Pattern testing is thus most useful in conjunction
with range checks on the individual parts of the date.
The prevalence of these issues in data-transfer problems means that you’ll probably end up writing some of your own validators on occasion to handle very specific date formats. Later sections of this chapter can provide additional assistance. For example, Converting Two-Digit Year Values to Four-Digit Form covers conversion of two-digit year values to four-digit form, and Performing Validity Checking on Date or Time Subparts discusses how to perform validity checking on components of date or time values.
3.142.199.181