Changeset 1710 in subversion
- Timestamp:
- Aug 30, 2008 7:24:39 AM (5 years ago)
- Location:
- trunk/roundcubemail
- Files:
-
- 2 edited
-
CHANGELOG (modified) (1 diff)
-
program/lib/html2text.php (modified) (15 diffs)
Legend:
- Unmodified
- Added
- Removed
-
trunk/roundcubemail/CHANGELOG
r1709 r1710 1 1 CHANGELOG RoundCube Webmail 2 2 --------------------------- 3 4 2008/08/30 (alec) 5 ---------- 6 - Improved HTML to TXT conversion by html2text class update 7 to version 1.0.0 3 8 4 9 2008/08/28 (alec) -
trunk/roundcubemail/program/lib/html2text.php
r1696 r1710 2 2 3 3 /************************************************************************* 4 * * 5 * class.html2text.inc * 6 * * 7 ************************************************************************* 8 * * 9 * Converts HTML to formatted plain text * 10 * * 11 * Copyright (c) 2005 Jon Abernathy <jon@chuggnutt.com> * 12 * All rights reserved. * 13 * * 14 * This script is free software; you can redistribute it and/or modify * 15 * it under the terms of the GNU General Public License as published by * 16 * the Free Software Foundation; either version 2 of the License, or * 17 * (at your option) any later version. * 18 * * 19 * The GNU General Public License can be found at * 20 * http://www.gnu.org/copyleft/gpl.html. * 21 * * 22 * This script is distributed in the hope that it will be useful, * 23 * but WITHOUT ANY WARRANTY; without even the implied warranty of * 24 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * 25 * GNU General Public License for more details. * 26 * * 27 * Author(s): Jon Abernathy <jon@chuggnutt.com> * 28 * * 29 * Last modified: 04/06/05 * 30 * Modified: 2004/05/19 (tbr) * 31 * * 32 *************************************************************************/ 33 34 /* 2008/08/29: Added PRE handling by A.L.E.C <alec@alec.pl> */ 4 * * 5 * class.html2text.inc * 6 * * 7 ************************************************************************* 8 * * 9 * Converts HTML to formatted plain text * 10 * * 11 * Copyright (c) 2005-2007 Jon Abernathy <jon@chuggnutt.com> * 12 * All rights reserved. * 13 * * 14 * This script is free software; you can redistribute it and/or modify * 15 * it under the terms of the GNU General Public License as published by * 16 * the Free Software Foundation; either version 2 of the License, or * 17 * (at your option) any later version. * 18 * * 19 * The GNU General Public License can be found at * 20 * http://www.gnu.org/copyleft/gpl.html. * 21 * * 22 * This script is distributed in the hope that it will be useful, * 23 * but WITHOUT ANY WARRANTY; without even the implied warranty of * 24 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * 25 * GNU General Public License for more details. * 26 * * 27 * Author(s): Jon Abernathy <jon@chuggnutt.com> * 28 * * 29 * Last modified: 08/08/07 * 30 * * 31 *************************************************************************/ 32 35 33 36 34 /** 37 * Takes HTML and converts it to formatted, plain text. 38 * 39 * Thanks to Alexander Krug (http://www.krugar.de/) to pointing out and 40 * correcting an error in the regexp search array. Fixed 7/30/03. 41 * 42 * Updated set_html() function's file reading mechanism, 9/25/03. 43 * 44 * Thanks to Joss Sanglier (http://www.dancingbear.co.uk/) for adding 45 * several more HTML entity codes to the $search and $replace arrays. 46 * Updated 11/7/03. 47 * 48 * Thanks to Darius Kasperavicius (http://www.dar.dar.lt/) for 49 * suggesting the addition of $allowed_tags and its supporting function 50 * (which I slightly modified). Updated 3/12/04. 51 * 52 * Thanks to Justin Dearing for pointing out that a replacement for the 53 * <TH> tag was missing, and suggesting an appropriate fix. 54 * Updated 8/25/04. 55 * 56 * Thanks to Mathieu Collas (http://www.myefarm.com/) for finding a 57 * display/formatting bug in the _build_link_list() function: email 58 * readers would show the left bracket and number ("[1") as part of the 59 * rendered email address. 60 * Updated 12/16/04. 61 * 62 * Thanks to Wojciech Bajon (http://histeria.pl/) for submitting code 63 * to handle relative links, which I hadn't considered. I modified his 64 * code a bit to handle normal HTTP links and MAILTO links. Also for 65 * suggesting three additional HTML entity codes to search for. 66 * Updated 03/02/05. 67 * 68 * Thanks to Jacob Chandler for pointing out another link condition 69 * for the _build_link_list() function: "https". 70 * Updated 04/06/05. 71 * 72 * @author Jon Abernathy <jon@chuggnutt.com> 73 * @version 0.6.1 74 * @since PHP 4.0.2 75 */ 35 * Takes HTML and converts it to formatted, plain text. 36 * 37 * Thanks to Alexander Krug (http://www.krugar.de/) to pointing out and 38 * correcting an error in the regexp search array. Fixed 7/30/03. 39 * 40 * Updated set_html() function's file reading mechanism, 9/25/03. 41 * 42 * Thanks to Joss Sanglier (http://www.dancingbear.co.uk/) for adding 43 * several more HTML entity codes to the $search and $replace arrays. 44 * Updated 11/7/03. 45 * 46 * Thanks to Darius Kasperavicius (http://www.dar.dar.lt/) for 47 * suggesting the addition of $allowed_tags and its supporting function 48 * (which I slightly modified). Updated 3/12/04. 49 * 50 * Thanks to Justin Dearing for pointing out that a replacement for the 51 * <TH> tag was missing, and suggesting an appropriate fix. 52 * Updated 8/25/04. 53 * 54 * Thanks to Mathieu Collas (http://www.myefarm.com/) for finding a 55 * display/formatting bug in the _build_link_list() function: email 56 * readers would show the left bracket and number ("[1") as part of the 57 * rendered email address. 58 * Updated 12/16/04. 59 * 60 * Thanks to Wojciech Bajon (http://histeria.pl/) for submitting code 61 * to handle relative links, which I hadn't considered. I modified his 62 * code a bit to handle normal HTTP links and MAILTO links. Also for 63 * suggesting three additional HTML entity codes to search for. 64 * Updated 03/02/05. 65 * 66 * Thanks to Jacob Chandler for pointing out another link condition 67 * for the _build_link_list() function: "https". 68 * Updated 04/06/05. 69 * 70 * Thanks to Marc Bertrand (http://www.dresdensky.com/) for 71 * suggesting a revision to the word wrapping functionality; if you 72 * specify a $width of 0 or less, word wrapping will be ignored. 73 * Updated 11/02/06. 74 * 75 * *** Big housecleaning updates below: 76 * 77 * Thanks to Colin Brown (http://www.sparkdriver.co.uk/) for 78 * suggesting the fix to handle </li> and blank lines (whitespace). 79 * Christian Basedau (http://www.movetheweb.de/) also suggested the 80 * blank lines fix. 81 * 82 * Special thanks to Marcus Bointon (http://www.synchromedia.co.uk/), 83 * Christian Basedau, Norbert Laposa (http://ln5.co.uk/), 84 * Bas van de Weijer, and Marijn van Butselaar 85 * for pointing out my glaring error in the <th> handling. Marcus also 86 * supplied a host of fixes. 87 * 88 * Thanks to Jeffrey Silverman (http://www.newtnotes.com/) for pointing 89 * out that extra spaces should be compressed--a problem addressed with 90 * Marcus Bointon's fixes but that I had not yet incorporated. 91 * 92 * Thanks to Daniel Schledermann (http://www.typoconsult.dk/) for 93 * suggesting a valuable fix with <a> tag handling. 94 * 95 * Thanks to Wojciech Bajon (again!) for suggesting fixes and additions, 96 * including the <a> tag handling that Daniel Schledermann pointed 97 * out but that I had not yet incorporated. I haven't (yet) 98 * incorporated all of Wojciech's changes, though I may at some 99 * future time. 100 * 101 * *** End of the housecleaning updates. Updated 08/08/07. 102 * 103 * @author Jon Abernathy <jon@chuggnutt.com> 104 * @version 1.0.0 105 * @since PHP 4.0.2 106 */ 76 107 class html2text 77 108 { … … 95 126 /** 96 127 * Maximum width of the formatted text, in columns. 128 * 129 * Set this value to 0 (or less) to ignore word wrapping 130 * and not constrain text to a fixed-width column. 97 131 * 98 132 * @var integer $width … … 112 146 "/\r/", // Non-legal carriage return 113 147 "/[\n\t]+/", // Newlines and tabs 148 '/[ ]{2,}/', // Runs of spaces, pre-handling 114 149 '/<script[^>]*>.*?<\/script>/i', // <script>s -- which strip_tags supposedly has problems with 150 '/<style[^>]*>.*?<\/style>/i', // <style>s -- which strip_tags supposedly has problems with 115 151 //'/<!-- .* -->/', // Comments -- which strip_tags might have problem a with 116 '/<a [^>]*href=("|\')([^"\']+)\1[^>]*>(.+?)<\/a>/ie', // <a href=""> 117 '/<h[123][^>]*>(.+?)<\/h[123]>/ie', // H1 - H3 118 '/<h[456][^>]*>(.+?)<\/h[456]>/ie', // H4 - H6 152 '/<h[123][^>]*>(.*?)<\/h[123]>/ie', // H1 - H3 153 '/<h[456][^>]*>(.*?)<\/h[456]>/ie', // H4 - H6 119 154 '/<p[^>]*>/i', // <P> 120 155 '/<br[^>]*>/i', // <br> 121 '/<b[^>]*>(.+?)<\/b>/ie', // <b> 122 '/<i[^>]*>(.+?)<\/i>/i', // <i> 156 '/<b[^>]*>(.*?)<\/b>/ie', // <b> 157 '/<strong[^>]*>(.*?)<\/strong>/ie', // <strong> 158 '/<i[^>]*>(.*?)<\/i>/i', // <i> 159 '/<em[^>]*>(.*?)<\/em>/i', // <em> 123 160 '/(<ul[^>]*>|<\/ul>)/i', // <ul> and </ul> 124 161 '/(<ol[^>]*>|<\/ol>)/i', // <ol> and </ol> 162 '/<li[^>]*>(.*?)<\/li>/i', // <li> and </li> 125 163 '/<li[^>]*>/i', // <li> 164 '/<a [^>]*href=("|\')([^"\']+)\1[^>]*>(.*?)<\/a>/ie', 165 // <a href=""> 126 166 '/<hr[^>]*>/i', // <hr> 127 167 '/(<table[^>]*>|<\/table>)/i', // <table> and </table> 128 168 '/(<tr[^>]*>|<\/tr>)/i', // <tr> and </tr> 129 '/<td[^>]*>(.+?)<\/td>/i', // <td> and </td> 130 '/<th[^>]*>(.+?)<\/th>/ie', // <th> and </th> 131 '/ /i', 132 '/"/i', 133 '/>/i', 134 '/</i', 135 '/&(amp|#38);/i', 136 '/©/i', 137 '/™/i', 138 '/“/', 139 '/”/', 140 '/–/', 141 '/&#(8217|39);/', 142 '/©/', 143 '/™/', 144 '/—/', 145 '/“/', 146 '/”/', 147 '/•/', 148 '/®/i', 149 '/•/i', 150 '/&[&;]+;/i' 169 '/<td[^>]*>(.*?)<\/td>/i', // <td> and </td> 170 '/<th[^>]*>(.*?)<\/th>/ie', // <th> and </th> 171 '/&(nbsp|#160);/i', // Non-breaking space 172 '/&(quot|rdquo|ldquo|#8220|#8221|#147|#148);/i', 173 // Double quotes 174 '/&(apos|rsquo|lsquo|#8216|#8217);/i', // Single quotes 175 '/>/i', // Greater-than 176 '/</i', // Less-than 177 '/&(amp|#38);/i', // Ampersand 178 '/&(copy|#169);/i', // Copyright 179 '/&(trade|#8482|#153);/i', // Trademark 180 '/&(reg|#174);/i', // Registered 181 '/&(mdash|#151|#8212);/i', // mdash 182 '/&(ndash|minus|#8211|#8722);/i', // ndash 183 '/&(bull|#149|#8226);/i', // Bullet 184 '/&(pound|#163);/i', // Pound sign 185 '/&(euro|#8364);/i', // Euro sign 186 '/&[^&;]+;/i', // Unknown/unhandled entities 187 '/[ ]{2,}/' // Runs of spaces, post-handling 151 188 ); 152 189 … … 161 198 '', // Non-legal carriage return 162 199 ' ', // Newlines and tabs 200 ' ', // Runs of spaces, pre-handling 163 201 '', // <script>s -- which strip_tags supposedly has problems with 164 //'', // Comments -- which strip_tags might have problem awith165 '$this->_build_link_list("\\2", "\\3")', // <a href="">202 '', // <style>s -- which strip_tags supposedly has problems with 203 //'', // Comments -- which strip_tags might have problem a with 166 204 "strtoupper(\"\n\n\\1\n\n\")", // H1 - H3 167 "ucwords(\"\n\n\\1\n\")", // H4 - H6168 "\n\n", // <P>205 "ucwords(\"\n\n\\1\n\")", // H4 - H6 206 "\n\n", // <P> 169 207 "\n", // <br> 170 208 'strtoupper("\\1")', // <b> 209 'strtoupper("\\1")', // <strong> 171 210 '_\\1_', // <i> 211 '_\\1_', // <em> 172 212 "\n\n", // <ul> and </ul> 173 213 "\n\n", // <ol> and </ol> 174 "\t*", // <li> 175 "\n-------------------------\n", // <hr> 176 "\n\n", // <table> and </table> 214 "\t* \\1\n", // <li> and </li> 215 "\n\t* ", // <li> 216 '$this->_build_link_list("\\2", "\\3")', 217 // <a href=""> 218 "\n-------------------------\n", // <hr> 219 "\n\n", // <table> and </table> 177 220 "\n", // <tr> and </tr> 178 221 "\t\t\\1\n", // <td> and </td> 179 222 "strtoupper(\"\t\t\\1\n\")", // <th> and </th> 180 ' ', 181 '"', 223 ' ', // Non-breaking space 224 '"', // Double quotes 225 "'", // Single quotes 182 226 '>', 183 227 '<', … … 185 229 '(c)', 186 230 '(tm)', 187 ' "',188 ' "',231 '(R)', 232 '--', 189 233 '-', 190 "'",191 '(c)',192 '(tm)',193 '--',194 '"',195 '"',196 234 '*', 197 '(R)', 198 '*', 199 '' 235 '£', 236 'EUR', // Euro sign. ? 237 '', // Unknown/unhandled entities 238 ' ' // Runs of spaces, post-handling 200 239 ); 201 240 202 /**203 * List of preg* regular expression patterns to search for in PRE body,204 * used in conjunction with $pre_replace.205 *206 * @var array $pre_search207 * @access public208 * @see $pre_replace209 */241 /** 242 * List of preg* regular expression patterns to search for in PRE body, 243 * used in conjunction with $pre_replace. 244 * 245 * @var array $pre_search 246 * @access public 247 * @see $pre_replace 248 */ 210 249 var $pre_search = array( 211 250 "/\n/", … … 251 290 * Indicates whether content in the $html variable has been converted yet. 252 291 * 253 * @var boolean $ converted292 * @var boolean $_converted 254 293 * @access private 255 294 * @see $html, $text … … 260 299 * Contains URL addresses from links to be rendered in plain text. 261 300 * 262 * @var string $ link_list301 * @var string $_link_list 263 302 * @access private 264 303 * @see _build_link_list() 265 304 */ 266 var $_link_list = array();305 var $_link_list = ''; 267 306 268 307 /** 269 * Boolean flag, true if a table of link URLs should be listed after the text. 270 * 271 * @var boolean $_do_links 272 * @access private 273 * @see html2text() 274 */ 275 var $_do_links = true; 308 * Number of valid links detected in the text, used for plain text 309 * display (rendered similar to footnotes). 310 * 311 * @var integer $_link_count 312 * @access private 313 * @see _build_link_list() 314 */ 315 var $_link_count = 0; 276 316 277 317 /** … … 284 324 * @param string $source HTML content 285 325 * @param boolean $from_file Indicates $source is a file to pull content from 286 * @param boolean $do_link_table indicate whether a table of link URLs is desired 287 * @access public 288 * @return void 289 */ 290 function html2text( $source = '', $from_file = false, $produce_link_table = true ) 326 * @access public 327 * @return void 328 */ 329 function html2text( $source = '', $from_file = false ) 291 330 { 292 331 if ( !empty($source) ) { … … 294 333 } 295 334 $this->set_base_url(); 296 $this->_do_links = $produce_link_table;297 335 } 298 336 … … 308 346 { 309 347 if ( $from_file && file_exists($source) ) { 310 $this->html = file_get_contents($source); 311 } 312 else 313 $this->html = $source; 348 $this->html = file_get_contents($source); 349 } 350 else 351 $this->html = $source; 352 314 353 315 354 $this->_converted = false; … … 378 417 { 379 418 if ( empty($url) ) { 380 $this->url = 'http://' . $_SERVER['HTTP_HOST']; 419 if ( !empty($_SERVER['HTTP_HOST']) ) { 420 $this->url = 'http://' . $_SERVER['HTTP_HOST']; 421 } else { 422 $this->url = ''; 423 } 381 424 } else { 382 425 // Strip any trailing slashes for consistency (relative … … 403 446 { 404 447 // Variables used for building the link list 405 //$link_count = 1;406 //$this->_link_list = '';448 $this->_link_count = 0; 449 $this->_link_list = ''; 407 450 408 451 $text = trim(stripslashes($this->html)); 409 452 410 453 // Convert <PRE> 411 $this->_convert_pre($text);412 454 $this->_convert_pre($text); 455 413 456 // Run our defined search-and-replace 414 457 $text = preg_replace($this->search, $this->replace, $text); … … 418 461 419 462 // Bring down number of empty lines to 2 max 420 $text = preg_replace("/\n\s+\n/", "\n ", $text);463 $text = preg_replace("/\n\s+\n/", "\n\n", $text); 421 464 $text = preg_replace("/[\n]{3,}/", "\n\n", $text); 422 465 423 466 // Add link list 424 if ( sizeof($this->_link_list) ) { 425 $text .= "\n\nLinks:\n------\n"; 426 foreach ($this->_link_list as $id => $link) { 427 $text .= '[' . ($id+1) . '] ' . $link . "\n"; 428 } 467 if ( !empty($this->_link_list) ) { 468 $text .= "\n\nLinks:\n------\n" . $this->_link_list; 429 469 } 430 470 431 471 // Wrap the text to a readable format 432 472 // for PHP versions >= 4.0.2. Default width is 75 433 $text = wordwrap($text, $this->width); 473 // If width is 0 or less, don't wrap the text. 474 if ( $this->width > 0 ) { 475 $text = wordwrap($text, $this->width); 476 } 434 477 435 478 $this->text = $text; … … 446 489 * and relative links. 447 490 * 448 * @param integer $link_count Counter tracking current link number449 491 * @param string $link URL of the link 450 492 * @param string $display Part of the text to associate number with 451 493 * @access private 452 494 * @return string 453 */ 454 function _build_link_list($link, $display) 455 { 456 if (! $this->_do_links) return $display; 457 458 $link_lc = strtolower($link); 459 460 if (substr($link_lc, 0, 7) == 'http://' || substr($link_lc, 0, 8) == 'https://' || substr($link_lc, 0, 7) == 'mailto:') 461 { 462 $url = $link; 463 } 464 else 465 { 466 $url = $this->url; 467 if ($link{0} != '/') { 468 $url .= '/'; 495 */ 496 function _build_link_list( $link, $display ) 497 { 498 if ( substr($link, 0, 7) == 'http://' || substr($link, 0, 8) == 'https://' || 499 substr($link, 0, 7) == 'mailto:' ) { 500 $this->_link_count++; 501 $this->_link_list .= "[" . $this->_link_count . "] $link\n"; 502 $additional = ' [' . $this->_link_count . ']'; 503 } elseif ( substr($link, 0, 11) == 'javascript:' ) { 504 // Don't count the link; ignore it 505 $additional = ''; 506 // what about href="#anchor" ? 507 } else { 508 $this->_link_count++; 509 $this->_link_list .= "[" . $this->_link_count . "] " . $this->url; 510 if ( substr($link, 0, 1) != '/' ) { 511 $this->_link_list .= '/'; 469 512 } 470 $url .= $link; 471 } 472 473 $index = array_search($url, $this->_link_list); 474 if ($index===FALSE) 475 { 476 $index = sizeof($this->_link_list); 477 $this->_link_list[$index] = $url; 478 } 479 480 return $display . ' [' . ($index+1) . ']'; 481 } 482 513 $this->_link_list .= "$link\n"; 514 $additional = ' [' . $this->_link_count . ']'; 515 } 516 517 return $display . $additional; 518 } 519 483 520 /** 484 521 * Helper function for PRE body conversion. … … 486 523 * @param string HTML content 487 524 * @access private 488 */525 */ 489 526 function _convert_pre(&$text) 490 {491 while(preg_match('/<pre[^>]*>(.*)<\/pre>/ismU', $text, $matches))492 {493 $result = preg_replace($this->pre_search, $this->pre_replace, $matches[1]);494 $text = preg_replace('/<pre[^>]*>.*<\/pre>/ismU', '<div><br>' . $result . '<br></div>', $text);495 }496 }527 { 528 while(preg_match('/<pre[^>]*>(.*)<\/pre>/ismU', $text, $matches)) 529 { 530 $result = preg_replace($this->pre_search, $this->pre_replace, $matches[1]); 531 $text = preg_replace('/<pre[^>]*>.*<\/pre>/ismU', '<div><br>' . $result . '<br></div>', $text); 532 } 533 } 497 534 } 498 535
Note: See TracChangeset
for help on using the changeset viewer.
