regex - Trying to match a partial name of file in text of file + additional text -
hi i'm trying match partial name of file in text of file + additional text.
basically i've got files named this:
pieceiwanttomatch_don't_care_about_this.txt
and i'm trying match first 7 letters of file name plus string in file , i'm not having luck.
here's have far:
use strict; use warnings; use file::path qw(make_path remove_tree); $calls_dir = "ask/parsed/html/"; opendir(my $search_dir, $calls_dir) or die "$!\n"; @files = grep /\.txt$/i, readdir $search_dir; closedir $search_dir; #print "got ", scalar @files, " files\n"; #my %seen = (); $file (@files) { %seen = (); $current_file = $calls_dir . $file; open $file, '<', $current_file or die "$file: $!\n"; while (<$file>) { #if (/phone/i) { chomp; #if (/phone\s*(.*)\r?$/i) { #if (/^phone\s*:\s*(.*)\r?$/i) { #if (/contact\s*(.*)\r?$/i) { #if (/^*(.*)team\s*(.*)\r?$/i) { print substr(${file}, 0, 7); if (/^(?=.* 'substr(${file}, 0, 7)')(?=.*management)/s) { $seen{$1} = 1; #print $file."\t"."$_\n"; #open $fh, '>', "ask/parsed/html2/"."${file}.parsed_for_contact_us.txt" or die $!; make_path('ask/parsed/html2/'); open $fh, '>', "ask/parsed/html2/" . "${file}.parsed_for_management.txt" or die $!; #open $fh, '>', "$_"."result".".txt" or die $!; #$fh->print("$file\t$_\n"); $fh->print("$_\n"); print "$_\n"; #print "\t"; print "\n"; print "\t"; #print "$_\n"; #print "\t"; #print "\n"; foreach $addr (sort keys %seen) { } } } close $file; }
here's example people at:
i think example of i'm trying do: file named nintendo_ask_parse.html
. i'm trying use string nintendo
file name string, game
, find line in file , print file.
added 11-12-2014 here's more data requested few have kindly been helping me far. i'm running first script wrote pull urls files. here's script:
use strict; use warnings; use lwp::simple; $link1 = "http://www.ask.com/web?q="; $link2 = "+video+game&qsrc=0&o=0&l=dir&qo=homepagesearchbox"; #my $link3 = "http://www."; #my $link4 = "http://www.manta.com/search? search_source=nav&pt=&search_location=burlingame+ca&search="; open (my $fh2, "untitled.txt") or die "could not open file"; while (my $row = <$fh2>) { chomp $row; print "$row\n"; $xml1 = $link1 . $row. $link2 ; #my $xmla = $link3 . $row . ".com"; #my $xmlx = $link4 . $row; mkdir 'ask', 0755; $filename1 = "ask/".($row)."_"."ask".".html"; open $fh1, ">", $filename1 or die("could not open file. $!"); print $row; $xml2 = $xml1; print $xml1; print "\n"; print $fh1 $xml2; }
============================================================================= after script runs html files based on # of entries in untitled.txt file, 1 per entry.
i have 4 example files, named activision_ask.html, apple_ask.html, atari_ask.html, nintendo_ask.html running script above. here contents 1 file activion_ask.html:
answers q&a community advanced search images news first video game invented video game design wii video game designer career video game companies spider-man 3 video game video game walkthroughs video game statistics call of duty 4 more answers amazon.com results activision source activision publishing, inc. american video game publisher. founded on october 1, 1979 , world's first independent developer , distributor of video games gaming consoles. first products cartridges atari 2600 video console system published july 1980 market , august 1981 international market (uk). activision 1 of largest video game publishers in world , top publisher 2... read more » go to: ask encyclopedia · images · videos browse article: history · studios · notable games published · upcoming games · references · source: wikipedia related questions: • video game publisher of loom? • developing games activision , have done in past? hear handheld versions of game different console versions. care enlighten us? • game created "activision" "atari 2600". 4 players play @ 1 time. 1 it? view more q&a » www.giantbomb.com/activision/3010-78/ oct 9, 2014 ... activision largest third-party publisher in world. became first third- party developer video game consoles, , responsible ... explore more answers source: www.kgbanswers.com · privacy · terms · careers · ask blog · q&a · mobile · · feedback © 2014 ask.com **truncated
=============================================================================
there's second script pulls out of links html file above , puts file. here's script:
=============================================================================
use lib '/users/lialin/perl5/lib/perl5'; use strict; use warnings; use feature 'say'; use file::slurp 'slurp'; # makes easy read files. use mojo; use mojo::useragent; use uri; use file::path qw(make_path remove_tree); #my $html_file = shift @argv; # take file command lin $calls_dir = "ask/"; opendir(my $search_dir, $calls_dir) or die "$!\n"; @html_files = grep /\.html$/i, readdir $search_dir; closedir $search_dir; #print "got ", scalar @files, " files\n"; #my %seen = (); foreach $html_files (@html_files) { %seen = (); $current_file = $calls_dir . $html_files; open $file, '<', $current_file or die "$html_files: $!\n"; $dom = mojo::dom->new(scalar slurp $calls_dir .$html_files); print $calls_dir .$html_files ; #for $csshref ($dom->find('a[href]')->attr('href')->each) { #for $link ($dom->find('a[href]')->attr('href')->each) { # print $1; #say $1 #if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s; make_path('ask/parsed/html/'); open $fh, '>', "ask/parsed/html/${html_files}.result.txt" or die $!; $csshref ($dom->find('a[href]')->attr('href')->each) { $cssurl = uri->new($csshref)->abs($calls_dir .$html_files); #open $fh, '>', "ask/${html_files}.result.txt" or die $!; $fh->print("$html_files\n"); $fh->print("$cssurl\n"); #$fh->print("\t"."$_\n"); #print "$cssurl\n"; #print $file."\t"."$_\n";}}
====================================================
the resulting files (using activision example again):
=============================================================================
activision_ask.html http://www.ask.com/answers/browse? qsrc=167&q=activision+video+game&qo=channelnavigation&o=0&l=dir activision_ask.html http://www.ask.com/answers/browse?qsrc=167&q=activision+video+game&o=0&l=dir#opensignin activision_ask.html http://www.ask.com/answers/profile?qsrc=3099 activision_ask.html http://www.ask.com/answers/profile?qsrc=3099 activision_ask.html javascript:void(0); activision_ask.html http://www.ask.com/advancedsearch? qsrc=167&q=activision+video+game&qo=channelnavigation&o=0&l=dir activision_ask.html http://www.ask.com/?o=0&l=dir&qsrc=14137 activision_ask.html http://www.ask.com/pictures?q=activision+video+game&qsrc=167&qo=channelnavigation&o=0&l=dir activision_ask.html http://www.ask.com/news?q=activision+video+game&qsrc=167&qo=channelnavigation&o=0&l=dir activision_ask.html http://www.ask.com/youtube?q=activision+video+game&qsrc=167&qo=channelnavigation&o=0&l=dir activision_ask.html http://www.ask.com/shopping?q=activision+video+game&qsrc=167&qo=channelnavigation&o=0&l=dir activision_ask.html javascript:void(0); activision_ask.html http://www.ask.com/maps?q=activision+video+game&qsrc=167&qo=channelnavigation&o=0&l=dir activision_ask.html javascript:void(0); activision_ask.html http://www.ask.com/web?q=video+game+cheats&qsrc=466&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/web?q=video+game+tester&qsrc=466&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/web?q=create+your+own+video+games&qsrc=466&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/web?q=first+video+game+invented&qsrc=466&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/web?q=video+game+design&qsrc=466&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/web?q=wii&qsrc=466&o=0&l=dir&qo=relatedsearchexpand activision_ask.html http://www.ask.com/web?q=video+game+designer+career&qsrc=466&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/web?q=video+game+companies&qsrc=466&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/web?q=spider-man+3+video+game&qsrc=466&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/web?q=video+game+walkthroughs&qsrc=466&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/web?q=video+game+statistics&qsrc=466&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/web?q=call+of+duty+4&qsrc=466&o=0&l=dir&qo=relatedsearchexpand activision_ask.html http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3daps&field- keywords=activision&x=0&y=0&tag=askcom05-20 activision_ask.html http://www.amazon.com/activision-anthology-playstation- 2/dp/b00006z7hq%3fpsc%3d1%26subscriptionid%3d06kmpshedsxxqmqvt482%26tag%3daskcom05-20%26linkcode%3dxm2%26camp%3d2025%26creative%3d165953%26creativeasin%3db00006z7hq activision_ask.html http://www.amazon.com/activision-anthology-playstation-2/dp/b00006z7hq%3fpsc%3d1%26subscriptionid%3d06kmpshedsxxqmqvt482%26tag%3daskcom05-20%26linkcode%3dxm2%26camp%3d2025%26creative%3d165953%26creativeasin%3db00006z7hq activision_ask.html http://www.amazon.com/destiny-xbox-360/dp/b002i096q4%3fpsc%3d1%26subscriptionid%3d06kmpshedsxxqmqvt482%26tag%3daskcom05-20%26linkcode%3dxm2%26camp%3d2025%26creative%3d165953%26creativeasin%3db002i096q4 activision_ask.html http://www.amazon.com/destiny-xbox-360/dp/b002i096q4%3fpsc%3d1%26subscriptionid%3d06kmpshedsxxqmqvt482%26tag%3daskcom05-20%26linkcode%3dxm2%26camp%3d2025%26creative%3d165953%26creativeasin%3db002i096q4 activision_ask.html http://www.amazon.com/skylanders-trap-team-not-machine-specific/dp/b00nca6zt0%3fpsc%3d1%26subscriptionid%3d06kmpshedsxxqmqvt482%26tag%3daskcom05-20%26linkcode%3dxm2%26camp%3d2025%26creative%3d165953%26creativeasin%3db00nca6zt0 activision_ask.html http://www.amazon.com/skylanders-trap-team-not-machine-specific/dp/b00nca6zt0%3fpsc%3d1%26subscriptionid%3d06kmpshedsxxqmqvt482%26tag%3daskcom05-20%26linkcode%3dxm2%26camp%3d2025%26creative%3d165953%26creativeasin%3db00nca6zt0 activision_ask.html http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3daps&field-keywords=activision&x=0&y=0&tag=askcom05-20 activision_ask.html http://www.ask.com/wiki/activision activision_ask.html http://www.ask.com/wiki/activision activision_ask.html http://en.wikipedia.org/wiki/file:activision.svg activision_ask.html http://www.ask.com/allabout?q=video%20game%20publisher&qsrc=470 activision_ask.html http://www.ask.com/allabout?q=video%20game%20console&qsrc=470 activision_ask.html http://www.ask.com/allabout?q=atari%202600&qsrc=470 activision_ask.html http://www.ask.com/wiki/activision activision_ask.html http://www.ask.com/wiki/activision#upcoming_games activision_ask.html http://www.ask.com/wiki/activision#references activision_ask.html http://en.wikipedia.org/wiki/activision activision_ask.html http://www.ask.com/web?q=who+was+the+video+game+publisher+of+loom%3f&qsrc=469&o=0&l=dir&qo=relatedquestions activision_ask.html http://www.ask.com/web?q=activision+video+game&qsrc=3060&o=0&l=dir activision_ask.html http://www.activision.com/ activision_ask.html http://www.activision.com/games activision_ask.html http://clk.about.com?zi=13/1to&ity=boostorg&o=0&ldid=4451&eng=boost&zu=http://vgstrategies.about.com/od/gameboycheatscodes/a/activision-anthology.htm http://www.gametrailers.com/company/pou3yf/activision activision_ask.html http://www.cnbc.com/id/102026893 activision_ask.html http://www.giantbomb.com/activision/3010-78/ activision_ask.html http://www.ask.com/web?q=history+of+video+game+systems&qsrc=467&o=0&l=dir&qo=relatedsearchnarrow activision_ask.html http://www.ask.com/mobile?&o=0&l=dir&qsrc=0 activision_ask.html http://help.ask.com activision_ask.html http://feedback.ask.com
============================================================================= i'm working on final script use part name of file , string read line or multiple lines file contain matching or close matching text.
in above example interested in 'http://www.activision.com/games' or url word 'activision' file name , word 'game' in it.
my file names in size , word game may come before or after file name.
i hope explanation , code helps others understand trying accomplish.
the problem have right the regex command searching strings. i'm working on making less strict , can't matching work properly.
as mentioned before i'm pretty versed in html , java know perl right language in , not expert (if @ code above) trying learn , complete task.
i'm not clear want do, given example file name
pieceiwanttomatch_don't_care_about_this.txt
suppose want find files first 7 characters pieceiw
end .txt
write
if ( /^pieceiw.*\.txt$/ ) { ... }
i hope helps
update
okay think want search .txt
files in directory lines contain first n characters of file name other specified string.
if don't know appear first -- file name prefix or other string -- along right lines double look-ahead. 1 refinement enclose strings in \q...\e
escapes non-word characters prevent regex metacharacters messing pattern.
note following
i've used
autodie
, explained in answer previous question. if you're running version of perl earlier v5.10 , can't upgrade won't able , have check status of each file operation separatelyit's important use absolute paths directories; otherwise user has make sure have correct current working directory before running program
i've put parameters program -- 2 directories , additional string searched - definitions @ top of program
i've used
glob
instead ofopendir
/readdir
/grep
because it's tidier, , file names include full path
use strict; use warnings; use 5.010; use autodie; use file::path qw/ make_path remove_tree /; use file::basename qw/ fileparse /; $calls_dir = '/path/to/ask/parsed/html'; $parsed_dir = '/path/to/ask/parsed/html2'; $wanted = 'game'; @files = glob "$calls_dir/*.txt"; printf "got %d files\n", scalar @files; $file (@files) { open $in_fh, '<', $file; $prefix = substr $file, 0, 8; print $prefix, "\n"; $basename = fileparse($file); make_path($parsed_dir); open $out_fh, '>', "$parsed_dir/${basename}_parsed_for_management.txt"; while (<$in_fh>) { print $out_fh $_ if / \q$prefix\e .* \q$wanted\e /x; } close $out_fh; }
update
this works fine
my ($wanted, $prefix) = qw/ game nintendo /; ( 'game.nintendo.com/phoenix.zhtml?c=121127&p=irol-gom' ) { print "ok\n" if / \q$wanted\e .* \q$prefix\e /x; }
output
ok
Comments
Post a Comment