Print 

Author Topic: Removing HTML tags  (Read 2520 times)

Offline WibblyWib

  • Newbie Poster
  • *
  • Posts: 37
  • Make it so!
Removing HTML tags
« on: August 29, 2010, 06:09:05 AM »
So I've got a bit of my website where users submit comments and it is displayed in a div. I want to remove ALL code which they might have placed in the submission, anything from a line break to embedded javascript. From a security perspective, would it be sufficient to simply have a php function which scans the string, and locates and removes any <s, >s, or &s?

I know there is a strip_tags function in php, but I would rather only remove the above characters incase it wasn't intended to be an html tag, and apparently strip_tags can get confused by certain symbols...
 
Anyone in-the-know on this kinda thing?

:) wib
« Last Edit: August 29, 2010, 06:21:37 AM by WibblyWib »

Offline MrMxyzptlk

  • Posts Too Much
  • *****
  • Posts: 9208
  • Never backward,           always forward!
    • My 5th Dimensional Homepage
Re: Removing HTML tags
« Reply #1 on: August 29, 2010, 02:01:56 PM »

No, that would not be sufficient.  (And no, I have never specifically done that kind of thing before....)

It seems to me the simplest way would be to disallow any non-alphanumeric keystroke input other than the few semi-standard language ones. (E.g.: ! ( ) - : ; " ' , . ? )

This, of course, would only apply to English style languages tho.

Masking keystroke input is very easy and straightforward in any web/programming language.  Here's an in-line javascript example function that limits keyboard input to numbers:


Code: [Select]
<script type="text/javascript"><!--

function onlyNumbers(evt)
{
//alert("@onlyNumbers1");
    evt = (evt) ? evt : event;
    var charCode = (evt.charCode) ? evt.charCode : ((evt.keyCode) ? evt.keyCode :
           ((evt.which) ? evt.which : 0));
//alert("onlyNumbers2: charCode=" + charCode);
    if (charCode > 31 && (charCode < 48 || charCode > 57))
    {
//        alert("Please enter numerals only in this field.");
        return false;
    }
    return true;
}

 --></script>
(Note: You'll need to alter the code snipet above to include all the characters that you want to allow....)

It can be invoked in HTML via

Code: [Select]
<input type=text onkeypress="return onlyNumbers(event)"> directive.


Mr. Mxy's current Word Corner word is catachresis    

Offline BFM_Kiwi

  • Major
  • *
  • Posts: 9174
Re: Removing HTML tags
« Reply #2 on: August 29, 2010, 05:19:41 PM »
I have tried this before (not with php though) and I would recommend you NOT try to write something to look at specific characters or patterns, if you are looking for something robust and safe, you will probably fail!  Because basically what you are trying to write is a small parser (or compiler) and that is not an easy thing to do.   If someone is malicious they'll probably be able to get around your attempt to block them

Rather than removing the characters, the best thing to do is simple encode the characters so that while they will display to the user, they will not be interpreted as html characters.

Most languages like php will have an html encode/decode.  Or in javascript you use escape (to encode) and unescape (to convert back to html).

So you would just say

var encodedHTML = escape( user_input_string )


See this link.  It has some javascript code and a live demo of it where you can type html tags and it will encode (escape) your text safely.  

http://www.yuki-onna.co.uk/html/encode.html


Offline WibblyWib

  • Newbie Poster
  • *
  • Posts: 37
  • Make it so!
Re: Removing HTML tags
« Reply #3 on: August 30, 2010, 09:44:04 AM »
Thanks for your replies MrMxyzptlk, BFM_Kiwi.

Kiwi I don't quite follow what the end result of what you are suggesting would be.

Once the string is escaped, and looks like this 'hello%3Cbr%3Enew%20line', then what...?

I could then replace all instances of %xx with a SPACE, for example? (or indeed selecting specific symbols to replace)

Or....?

I don't see how the escape function enables you to display html characters without executing them.

Also, what characters are unsafe other than & < > " ' ? php functions such as htmlspecialcharacters only deal with these...
« Last Edit: August 31, 2010, 05:18:24 AM by WibblyWib »

Offline WibblyWib

  • Newbie Poster
  • *
  • Posts: 37
  • Make it so!
Re: Removing HTML tags
« Reply #4 on: August 31, 2010, 03:20:52 PM »
Following further research, I wrote/converted a bit of php to convert characters to their ascii value, which can be printed as it is and will display nicely in a browser.

Seems like a good solution to me (though I'm no hacker... :hrmbig: ).

Heres the code (if anyone is interested!) Any characters which are not on the list get turned into SPACEs

Code: [Select]
<?php

$char_array
['a'] = "&#38;#97;";
$char_array['b'] = "&#38;#98;"
$char_array['c'] = "&#38;#99;";
$char_array['d'] = "&#38;#100;";
...... 
//here some more verboseness (letters and numbers)
$char_array[' '] = "&#38;#32;";
$char_array['!'] = "&#38;#33;";
$char_array['"'] = "&#38;#34;";
$char_array['#'] = "&#38;#35;";
$char_array['$'] = "&#38;#36;";
$char_array['%'] = "&#38;#37;";
$char_array['&'] = "&#38;#38;";
$char_array["'"] = "&#38;#39;";
$char_array['('] = "&#38;#40;";
$char_array[')'] = "&#38;#41;";
$char_array['*'] = "&#38;#42;";
$char_array['+'] = "&#38;#43;";
$char_array[','] = "&#38;#44;";
$char_array['-'] = "&#38;#45;";
$char_array['.'] = "&#38;#46;";
$char_array['/'] = "&#38;#47;";
$char_array[':'] = "&#38;#58;";
$char_array[';'] = "&#38;#59;";
$char_array['<'] = "&#38;#60;";
$char_array['='] = "&#38;#61;";
$char_array['>'] = "&#38;#62;";
$char_array['?'] = "&#38;#63;";
$char_array['@'] = "&#38;#64;";
$char_array['['] = "&#38;#91;";
$char_array['\\'] = "&#38;#92;"// double \\ required as one cancels the ' character. Tested - still works
$char_array[']'] = "&#38;#93;";
$char_array['^'] = "&#38;#94;";
$char_array['_'] = "&#38;#95;";
$char_array['`'] = "&#38;#96;";
$char_array['{'] = "&#38;#123;";
$char_array['|'] = "&#38;#124;";
$char_array['}'] = "&#38;#125;";
$char_array['~'] = "&#38;#126;";

$input_string "this the string~~;#'';}{}.;.*^&^&$)(*\© "//test input string

for($i=strlen($input_string)-1$i>=0$i--){

$char substr($input_string$i1); //get individual character
echo "<br>".$char.": "//echo for prosperity

$valid array_key_exists($char,$char_array); //check if char is in array

if($valid==false){ // i.e. not in the list of accepted chars
$input_string substr_replace($input_string,"&#38;#32;",$i,1); //replace char with SPACE
echo "invalid";
}
else{
$input_string substr_replace($input_string,$char_array[$char],$i,1); //replace char with ascii value
echo "valid";
}

}

echo 
"<br>".$input_string;

?>

Print